TESTING IN BIG DATA APPLICATION

As a greater number of organizations are adopting “Big Data” as their Data Analytics solution, they are finding it difficult to define a robust testing strategy and setting up an optimal test environment for Big Data. This is mostly due to the lack of knowledge and understanding on Big Data testing as the technology is still gaining popularity in the industry. Big Data involves processing of huge volume of structured/unstructured data across different nodes using languages such as “Map-reduce”, “Hive” and “Pig”. A robust testing strategy needs to be defined well in advance in order to ensure that the functional and non-functional requirements are met and that the data conforms to acceptable quality. In this blog we intend to define recommended test approaches in order to test “Hadoop” based applications.

Traditional testing approaches on Hadoop are based upon sample data record sets, which is fine for unit testing activities. However, the challenge comes in determining how to validate an entire data set consisting of millions, and even billions, of records.

The below diagram describes a typical Big Data implementation design.

In order to successfully test a Big Data Analytics application, the test strategy should include the following testing considerations at a minimum.

Data Staging Validation:

Data from various sourcing systems like RDBMS, social media, web logs etc. should be validated to ensure that proper data is pulled into the Hadoop system. Some of the high validations, which have to be performed, are:

Comparing source data with the data landed on Hadoop system to ensure they match
Verify the right data is extracted and loaded into the correct HDFS location

Some teams will always verify sample sets of data by using sampling algorithm because verification of full dataset are difficult to achieve. However this approach may not uncover all the data inconsistencies and may result in data quality issues within HDFS. Hence it is very important to include full dataset validation in your test strategy. Tools such as Datameer, Talend or Informatica can be used for validating the staged data. Import jobs should be created to pull the data into these cial media, web logs etc. should be validated to ensure that proper data is pulled into the Hadoop system. Some of the high validations, which have to be performed, are:

Comparing source data with the data landed on Hadoop system to ensure they match
Verify the right data is extracted and loaded into the correct HDFS location

Some teams will always verify sample sets of data by using sampling algorithm because verification of full dataset are difficult to achieve. However this approach may not uncover all the data inconsistencies and may result in data quality issues within HDFS. Hence it is very important to include full dataset validation in your test strategy. Tools such as Datameer, Talend or Informatica can be used for validating the staged data. Import jobs should be created to pull the data into these tools from source and staging systems. The data should be compared using the Data Analytics capability of these tools. The below diagram describes the overall approach for staging validations.

Fig 2: Data Staging Validation

Transformation or “Map-reduce” validation

This type of validation is similar to Data Warehousing Testing wherein a tester verifies that the business rules are applied on the data. However, in this case there is a slight difference in the test approach as Hadoop data should be tested for volume, variety and velocity.

Typically DWH testing involves testing of Gigabytes of data, whereas Hadoop testing in comparison involves testing of Petabytes of data. There is a definitive approach to test the DWH by using “sampling” techniques, although this cannot be achieved with a Hadoop application because even sampling testing will be challenging in a Hadoop Framework. There are numerous probabilities and combinations in large volumes of data which render sampling techniques ineffective as a validation approach.

DWH systems can only process structured data, however Hadoop systems may handle both structured and unstructured data with limited additional efforts. This Testing wherein a tester verifies that the business rules are applied on the data. However, in this case there is a slight difference in the test approach as Hadoop data should be tested for volume, variety and velocity.

DWH systems can only process structured data, however Hadoop systems may handle both structured and unstructured data with limited additional efforts. This potential is already leading to new ways of data exploration which in turn will result in an increase of scenarios for Hadoop Testing.

The key validations to be performed are:

Verification of the ETL: implemented on the data
Verification of data aggregations/segregation rules implemented on the data
Verification of output data. Validate that processed data remains the same even when executed on a distributed environment.
Verify the batch processes designed for data transformation.

Hive is the most reliable language to perform Data Warehouse Validation. Testers should write HQL queries that replicate the data requirements and compare it with the output produced by the MR jobs (Development team). If there are no discrepancies found in the report, then the test script is considered pass.

Data Warehouse Validation

Fig 3: Transformation Validation

Data Warehouse Validation

This testing is performed after the data processed in Hadoop environment is loaded to the Enterprise Data warehouse. The high level scenarios to be tested include:

Verify the processed data from HDFS file system is moved correctly to the EDW tables
Verify the EDW data requirements are met
Verify the data is aggregated as per specified requirement

Data warehouse validation is similar to data staging validation. DataMeer or Talend or Informatica tool can be used for validating the data loads from Hadoop to traditional Data warehouse.

Architecture Testing

As Hadoop involves processing large volumes of data, architecture testing is very critical for the application to be successful. Poorly designed systems may cause performance degradation and the system could fail to meet the SLAs contracted with the Business. At the minimum, Performance and Failover test services should be performed in a Hadoop environment.

Performance testing should be conducted by setting up large volumes of data with an environment similar to production. Performance metrics like job completion time, data throughput, memory utilization and similar system level metrics should be verified as part of the Performance testing services.

Failover test should be performed as Hadoop consists of name node and several data nodes hosted on different machines. The objective of Failover testing is to verify that data processing happens seamlessly in case of failure of data nodes.

Search This Blog

Effective Test strategy-Technique-Methdology for high test coverage