Big Data Testing Challenges and Solutions: TechnoCast - Spring 2017

QASource | June 28, 2017

Big Data is a problem statement that can be described in the image below:

The Four V's of Big Data


Solution to Big Data

Big Data can be analyzed for insights that lead to better decisions and strategic business moves. Below are the two solutions to analyze the Big Data:

Solution To Big Data
Apache Hadoop
  • Apache Hadoop

  • Apache Hadoop is a software framework for distributed processing of large datasets of Big Data across large clusters of computers.

    Hadoop framework has two main components:

    • HDFS
    • Execution Engine (MapReduce)
Apache Spark
  • Apache Spark

  • Apache Spark is an open source engine. It is a fast, expressive cluster computing system compatible with Apache Hadoop and works with any Hadoop-supported storage system.

    Framework Components:

    • Processing Engine: Instead of just “map” and “reduce”, defines a large set of operations
    • Key Construct: Resilient Distributed Dataset (RDD)

Hadoop MapReduce vs Spark

Who Wins the Battle?

Hadoop MapReduce (MR)



MapReduce is difficult to program and needs abstraction


Spark is easy to program and no need of abstraction

In Hadoop we do not have have interactive mode except Pig and Hive

Interactive Mode

Spark has interactive mode

Hadoop MR is used for generating the reports that help in finding the answers to historical queries


Spark makes it possible to perform streaming, batch processing and machine learning all in the same cluster

MR does not leverage the memory of the Hadoop cluster to maximum


Spark has been said to execute batch processing jobs near about 10 to 100 times faster than Hadoop MR

Hadoop MR can process a batch of stored data


Spark can be used to modify the data in real time through Spark streaming

MapReduce performs all the operations on disk


Spark ensures lower latency computations by using the Resilient Distributed Dataset (RDD)

Writing Hadoop MR pipelines is complex and lengthy process

Ease of Coding

Spark coding is always compact and easy than Hadoop MR code


Big Data Testing Challenges & Their Solutions




Data Harnessing (Cleansing)

Tester needs to work with both structured and unstructured data which makes sampling strategy very difficult.

We need to perform in depth analysis of structured and unstructured data to convert them into valuable format.

Data Quality & Completeness

Data from various sources like RDBMS, weblogs, social media, etc. are pulled, so it is difficult to make sure that complete data is pulled into system.

We can use tool like Presto, Talend and Datameer to verify the completeness of the data in HDFS.

Addressing Data Quality

Impact of inaccurate or untimely data is more pronounced in case of Big Data.

We need to proactively use Data governance or information management process in place to ensure that data is clean.

Displaying Meaningful Results

Creating the BI reports from Big Data becomes difficult when dealing with extremely large amounts and diverse data.

One way to resolve this is to cluster data into a higher-level view where smaller groups of data become visible.

Test Environment Setup

Creating effective test environment, multiple testing nodes for Big Data testing.

We should take care of the environment to handle the Big Data effectively and efficiently.

What Data To Track

Struggle to decide what data to track and how to apply what they’ve learned.

We need to stick with the data that is more accurate to the business and ignore irrelevant data.

Performance Testing

Faster data processing, work load, and network load balancing to ensure real time data synchronization.

We need to have good infrastructure to store and process large amount of data in given time intervals to meet the performance.


Key Takeaways

Key Takeaways
  • Big Data testing is very different from traditional data testing in terms of Data, Infrastructure & Validation Tools
  • Main stages of testing for Big Data applications are Data staging validation, MapReduce validation and Output validation phase
  • Widely used testing tools for Big Data testing are: TestingWhiz, QuerySurge and Tricentis
  • Important phase of Big Data testing is Architecture, as poorly designed system may lead to unprecedented errors and degradation of performance
  • Performance testing for Big Data includes Data throughput, Data processing, Sub-component performance
Have Suggestions?

Have Suggestions?

We would love to hear your feedback, questions, comments and suggestions. This will help us to make us better and more useful next time.
Share your thoughts and ideas at


The logos used in this post are owned by the individual companies of each logo or trademark and QASource claims no rights to ownership of the logos. Nor is QASource sponsored by, or associated with the owners of the logo, and uses them for informational purposes.

This publication is for informational purposes only, and nothing contained in it should be considered legal advice. We expressly disclaim any warranty or responsibility for damages arising out of this information and encourage you to consult with legal counsel regarding your specific needs. We do not undertake any duty to update previously posted materials.