QASource Newsletter

QASource Blog Technocast - Spring 2017

Technocast - Spring 2017

Big Data is a problem statement that can be described in the image below:

The Four V's of Big Data

Solution to Big Data

Big Data can be analyzed for insights that lead to better decisions and strategic business moves. Below are the two solutions to analyze the Big Data:

Solution To Big Data
Apache Hadoop
  • Apache Hadoop

  • Apache Hadoop is a software framework for distributed processing of large datasets of Big Data across large clusters of computers.

    Hadoop framework has two main components:

    • HDFS
    • Execution Engine (MapReduce)
Apache Spark
  • Apache Spark

  • Apache Spark is an open source engine. It is a fast, expressive cluster computing system compatible with Apache Hadoop and works with any Hadoop-supported storage system.

    Framework Components:

    • Processing Engine: Instead of just “map” and “reduce”, defines a large set of operations
    • Key Construct: Resilient Distributed Dataset (RDD)

Hadoop MapReduce vs Spark

Who Wins the Battle?

Hadoop MapReduce (MR) Aspect Spark
MapReduce is difficult to program and needs abstraction
Difficulty
Spark is easy to program and no need of abstraction
In Hadoop we do not have have interactive mode except Pig and Hive
Interactive Mode
Spark has interactive mode
Hadoop MR is used for generating the reports that help in finding the answers to historical queries
Streaming
Spark makes it possible to perform streaming, batch processing and machine learning all in the same cluster
MR does not leverage the memory of the Hadoop cluster to maximum
Performance
Spark has been said to execute batch processing jobs near about 10 to 100 times faster than Hadoop MR
Hadoop MR can process a batch of stored data
Streaming
Spark can be used to modify the data in real time through Spark streaming
MapReduce performs all the operations on disk
Latency
Spark ensures lower latency computations by using the Resilient Distributed Dataset (RDD)
Writing Hadoop MR pipelines is complex and lengthy process
Ease of Coding
Spark coding is always compact and easy than Hadoop MR code

Big Data Testing Challenges & Their Solutions

 
Challenges
Solutions
Data Harnessing (Cleansing)
Tester needs to work with both structured and unstructured data which makes sampling strategy very difficult.
We need to perform in depth analysis of structured and unstructured data to convert them into valuable format.
Data Quality & Completeness
Data from various sources like RDBMS, weblogs, social media, etc. are pulled, so it is difficult to make sure that complete data is pulled into system.
We can use tool like Presto, Talend and Datameer to verify the completeness of the data in HDFS.
Addressing Data Quality
Impact of inaccurate or untimely data is more pronounced in case of Big Data.
We need to proactively use Data governance or information management process in place to ensure that data is clean.
Displaying Meaningful Results
Creating the BI reports from Big Data becomes difficult when dealing with extremely large amounts and diverse data.
One way to resolve this is to cluster data into a higher-level view where smaller groups of data become visible.
Test Environment Setup
Creating effective test environment, multiple testing nodes for Big Data testing.
We should take care of the environment to handle the Big Data effectively and efficiently.
What Data To Track
Struggle to decide what data to track and how to apply what they’ve learned.
We need to stick with the data that is more accurate to the business and ignore irrelevant data.
Performance Testing
Faster data processing, work load, and network load balancing to ensure real time data synchronization.
We need to have good infrastructure to store and process large amount of data in given time intervals to meet the performance.

Key Takeaways

 Key Takeaways
  • Big Data testing is very different from traditional data testing in terms of Data, Infrastructure & Validation Tools
  • Main stages of testing for Big Data applications are Data staging validation, MapReduce validation and Output validation phase
  • Widely used testing tools for Big Data testing are: TestingWhiz, QuerySurge and Tricentis
  • Important phase of Big Data testing is Architecture, as poorly designed system may lead to unprecedented errors and degradation of performance
  • Performance testing for Big Data includes Data throughput, Data processing, Sub-component performance
Reliability Monitor

Have Suggestions?

We would love to hear your feedback, questions, comments and suggestions. This will help us to make us better and more useful next time.
Share your thoughts and ideas at knowledgecenter@qasource.com

Disclaimer

The logos used in this post are owned by the individual companies of each logo or trademark. The logo is not authorized by, sponsored by, or associated with the trademark owner, but QASource is using the logos only for reviewing purposes. The endorsement of the used logos by QASource is neither intended nor implied.

Written by QA Experts

QASource Blog, for executives and engineers, shares QA strategies, methodologies, and new ideas to inform and help effectively deliver quality products, websites, and applications.

Contact Us

Authors

Our bloggers are the test management experts at QASource. They are executives, QA managers, team leads, and testing practitioners. Their combined experience exceeds 100 years and they know how to optimize QA efforts in a variety of industries, domains, tools, and technologies.