Big data flow and assurance checkpoints

When implementing a big data testing strategy, it is important to understand the big data flow and assurance checkpoints.

Data coming from heterogeneous data sources, such as Excel File, Flat File, Fixed Length File, XML, JSON & BSON, Binary File, MP4, Flash Files, WAV, PDF File, Word Doc, HTML File, etc., needs to be dumped into big data stores in order to process it further and get meaningful information out of it. Before moving these data files into big data stores, it is always preferred to verify source file metadata and checksums as part of first level assurance check point.

Once heterogeneous source files are dumped into big data stores, as part of pre‑big data processing validation, it is important to verify whether files are dumped as per dumping rules and they are dumped completely without any data discrepancy in terms of extra, duplicate or missing data.

Once data dumping is completed, an initial level of data profiling and cleansing is done followed by actual functional algorithm execution in distributed mode. Testing of cleansing rules and functional algorithm are the main assurance check points at this layer.

After the execution of data processing algorithms, the cream of data obtained is then given to downstream systems such as the enterprise data warehouse for historical data analysis and reporting systems. Here report testing and ETL testing act as assurance check points.

All of these high‑level assurance checkpoints will ensure:

Data completeness – wherein end‑to‑end data validation among heterogeneous big data sources is tested.

Data transformation – wherein structured and unstructured data validations are performed based on business rules.

Data quality – wherein rejected, ignored and invalid data is identified.

Performance and scalability – wherein scalable technical architecture used for data processing is tested.

Traditional Data Processing versus big data Processing

Discussions around the differences between traditional data processing and big data processing models raised questions with respect to tools and technology, skill sets, process and templates and cost.

When considering the big data characteristics of volume, velocity and variety, the following challenges and recommended approaches to deal with them were highlighted at the roundtable sessions:

100% test coverage versus adequate test coverage: focus required on adequate test coverage by prioritising projects and situations.

Setting up a production‑like environment for performance testing: recommends simulating and test stubbing components which are not ready and available during testing phase.

High scripting efforts while dealing with variety of data: requires a strong collaboration among testing, development, IT and all other teams, along with a strong resource skill set.

Key Components of a big data Testing Strategy

The recommended big data testing strategy should include the following:

Big test data management.
Big data validation.
Big data profiling.
Big data security testing.
Failover testing.
Big data environment testing.
Big data performance testing.

SAMPLE STRATEGY:

Testing Big Data is one of the biggest challenges faced by organizations due to lack of understanding of what to test and how much data to test.Organizations have been facing challenges in defining test strategies for structured and unstructured data validation, setting up test environment, working with non-relational databases and performing non-functional testing.

Big Data Testing Approach

Testing needs to be performed at each of three phases of big data processing to ensure that data is getting processed without errors. Functional testing includes (i) validation of pre-hadoop processing (ii) validation of hadoop map reduce data output (iii) validation of data extract and load into EDW. Apart from these functional testing, non-functional testing like performance and fail over testing needs to be performed.

Image below shows a typical Big Data architecture diagram and highlights the area where testing should be focused.

Volume, Variety and Velocity: How to test?

During these phases of big data processing the three dimensions and characteristics of Big Data i.e. Volume, Variety and Velocity are validated to ensure there are no data quality defects or performance issues.

Volume: The amount of data created both inside and outside the organization. Huge volumes of data flows from multiple systems which needs to be processed and analyzed. To reduce time for execution we can run all comparison scripts in parallel on multiple nodes.

Variety: The variety of data is increasing mostly unstructured text based data and semi-structured data like social media data.

Velocity: The speed at which new data is being created and the need of real time analytic to derive business value from it is increasing thanks to digitization of transactions, mobile computing and sheer number of mobile and internet users.

Big Data Testing Types

Performance Testing: Any big data testing project involves in processing large volumes of structured and unstructured data and is processed across multiple nodes to complete the job in less time. Because of bad architecture and poorly designed code performance is degraded and SLA are not met which makes setting up of Hadoop and other big data technologies a waste of effort.

Areas where performance issues can occur are

Imbalance in input splits
Redundant shuffle and sorts
Moving most of aggregation computations to reduce process which can be done at map process.

Failover Testing: Hadoop architecture consists of name nodes and hundreds of data nodes hosted on several server machines and each of them are connected. There are chances of nodes failures and some of HDFS components becoming non-functional.

Failover testing is an important focus area in Big Data implementation with the objective of validating the recovery process and to ensure data processing happens smoothly when switched to other data nodes.

Some validations needs to be checked during Failover testing includes:

Validating that checkpoints of edit logs and FsImage of name nodes are happening at defined time intervals
Recovery of edit logs and FsImage files of name nodes.
No data corruption because of name node failure.
Data recovery when data node fails
Validating that replication is initiated when one of data nodes fails or data becomes corrupted.

Search This Blog

Effective Test strategy-Technique-Methdology for high test coverage