QASource Quarterly Expert Series

QASource Blog Big Data Testing Challenges and Solutions: TechnoCast - Spring 2017

Big Data Testing Challenges and Solutions: TechnoCast - Spring 2017

Big Data is a problem statement that can be described in the image below:

The Four V's of Big Data

Solution to Big Data

Big Data can be analyzed for insights that lead to better decisions and strategic business moves. Below are the two solutions to analyze the Big Data:

Solution To Big Data
Apache Hadoop
  • Apache Hadoop

  • Apache Hadoop is a software framework for distributed processing of large datasets of Big Data across large clusters of computers.

    Hadoop framework has two main components:

    • HDFS
    • Execution Engine (MapReduce)
Apache Spark
  • Apache Spark

  • Apache Spark is an open source engine. It is a fast, expressive cluster computing system compatible with Apache Hadoop and works with any Hadoop-supported storage system.

    Framework Components:

    • Processing Engine: Instead of just “map” and “reduce”, defines a large set of operations
    • Key Construct: Resilient Distributed Dataset (RDD)

Hadoop MapReduce vs Spark

Who Wins the Battle?

Hadoop MapReduce (MR) Aspect Spark
MapReduce is difficult to program and needs abstraction
Spark is easy to program and no need of abstraction
In Hadoop we do not have have interactive mode except Pig and Hive
Interactive Mode
Spark has interactive mode
Hadoop MR is used for generating the reports that help in finding the answers to historical queries
Spark makes it possible to perform streaming, batch processing and machine learning all in the same cluster
MR does not leverage the memory of the Hadoop cluster to maximum
Spark has been said to execute batch processing jobs near about 10 to 100 times faster than Hadoop MR
Hadoop MR can process a batch of stored data
Spark can be used to modify the data in real time through Spark streaming
MapReduce performs all the operations on disk
Spark ensures lower latency computations by using the Resilient Distributed Dataset (RDD)
Writing Hadoop MR pipelines is complex and lengthy process
Ease of Coding
Spark coding is always compact and easy than Hadoop MR code

Big Data Testing Challenges & Their Solutions

Data Harnessing (Cleansing)
Tester needs to work with both structured and unstructured data which makes sampling strategy very difficult.
We need to perform in depth analysis of structured and unstructured data to convert them into valuable format.
Data Quality & Completeness
Data from various sources like RDBMS, weblogs, social media, etc. are pulled, so it is difficult to make sure that complete data is pulled into system.
We can use tool like Presto, Talend and Datameer to verify the completeness of the data in HDFS.
Addressing Data Quality
Impact of inaccurate or untimely data is more pronounced in case of Big Data.
We need to proactively use Data governance or information management process in place to ensure that data is clean.
Displaying Meaningful Results
Creating the BI reports from Big Data becomes difficult when dealing with extremely large amounts and diverse data.
One way to resolve this is to cluster data into a higher-level view where smaller groups of data become visible.
Test Environment Setup
Creating effective test environment, multiple testing nodes for Big Data testing.
We should take care of the environment to handle the Big Data effectively and efficiently.
What Data To Track
Struggle to decide what data to track and how to apply what they’ve learned.
We need to stick with the data that is more accurate to the business and ignore irrelevant data.
Performance Testing
Faster data processing, work load, and network load balancing to ensure real time data synchronization.
We need to have good infrastructure to store and process large amount of data in given time intervals to meet the performance.

Key Takeaways

 Key Takeaways
  • Big Data testing is very different from traditional data testing in terms of Data, Infrastructure & Validation Tools
  • Main stages of testing for Big Data applications are Data staging validation, MapReduce validation and Output validation phase
  • Widely used testing tools for Big Data testing are: TestingWhiz, QuerySurge and Tricentis
  • Important phase of Big Data testing is Architecture, as poorly designed system may lead to unprecedented errors and degradation of performance
  • Performance testing for Big Data includes Data throughput, Data processing, Sub-component performance
Reliability Monitor

Have Suggestions?

We would love to hear your feedback, questions, comments and suggestions. This will help us to make us better and more useful next time.
Share your thoughts and ideas at


The logos used in this post are owned by the individual companies of each logo or trademark. The logo is not authorized by, sponsored by, or associated with the trademark owner, but QASource is using the logos only for reviewing purposes. The endorsement of the used logos by QASource is neither intended nor implied.


This publication is for informational purposes only and nothing contained in it should be considered legal advice. We expressly disclaim any warranty or responsibility for damages arising out of this information and encourage you to consult with legal counsel regarding your specific needs. We do not undertake any duty to update previously posted materials.