Big Data is a problem statement that can be described in the image below:
Solution to Big Data
Big Data can be analyzed for insights that lead to better decisions and strategic business moves. Below are the two solutions to analyze the Big Data:
-
Apache Hadoop
- Apache Hadoop is a software framework for distributed processing of large datasets of Big Data across large clusters of computers.
Hadoop framework has two main components:
- HDFS
- Execution Engine (MapReduce)
-
Apache Spark
- Apache Spark is an open source engine. It is a fast, expressive cluster computing system compatible with Apache Hadoop and works with any Hadoop-supported storage system.
Framework Components:
- Processing Engine: Instead of just “map” and “reduce”, defines a large set of operations
- Key Construct: Resilient Distributed Dataset (RDD)
Hadoop MapReduce vs Spark
Who Wins the Battle?
Hadoop MapReduce (MR) |
Aspect |
Spark |
---|---|---|
MapReduce is difficult to program and needs abstraction |
Difficulty |
Spark is easy to program and no need of abstraction |
In Hadoop we do not have have interactive mode except Pig and Hive |
Interactive Mode |
Spark has interactive mode |
Hadoop MR is used for generating the reports that help in finding the answers to historical queries |
Streaming |
Spark makes it possible to perform streaming, batch processing and machine learning all in the same cluster |
MR does not leverage the memory of the Hadoop cluster to maximum |
Performance |
Spark has been said to execute batch processing jobs near about 10 to 100 times faster than Hadoop MR |
Hadoop MR can process a batch of stored data |
Streaming |
Spark can be used to modify the data in real time through Spark streaming |
MapReduce performs all the operations on disk |
Latency |
Spark ensures lower latency computations by using the Resilient Distributed Dataset (RDD) |
Writing Hadoop MR pipelines is complex and lengthy process |
Ease of Coding |
Spark coding is always compact and easy than Hadoop MR code |
Big Data Testing Challenges & Their Solutions
Challenges
Solutions
Tester needs to work with both structured and unstructured data which makes sampling strategy very difficult.
We need to perform in depth analysis of structured and unstructured data to convert them into valuable format.
Data from various sources like RDBMS, weblogs, social media, etc. are pulled, so it is difficult to make sure that complete data is pulled into system.
We can use tool like Presto, Talend and Datameer to verify the completeness of the data in HDFS.
Impact of inaccurate or untimely data is more pronounced in case of Big Data.
We need to proactively use Data governance or information management process in place to ensure that data is clean.
Creating the BI reports from Big Data becomes difficult when dealing with extremely large amounts and diverse data.
One way to resolve this is to cluster data into a higher-level view where smaller groups of data become visible.
Creating effective test environment, multiple testing nodes for Big Data testing.
We should take care of the environment to handle the Big Data effectively and efficiently.
Struggle to decide what data to track and how to apply what they’ve learned.
We need to stick with the data that is more accurate to the business and ignore irrelevant data.
Faster data processing, work load, and network load balancing to ensure real time data synchronization.
We need to have good infrastructure to store and process large amount of data in given time intervals to meet the performance.
Key Takeaways
- Big Data testing is very different from traditional data testing in terms of Data, Infrastructure & Validation Tools
- Main stages of testing for Big Data applications are Data staging validation, MapReduce validation and Output validation phase
- Widely used testing tools for Big Data testing are: TestingWhiz, QuerySurge and Tricentis
- Important phase of Big Data testing is Architecture, as poorly designed system may lead to unprecedented errors and degradation of performance
- Performance testing for Big Data includes Data throughput, Data processing, Sub-component performance
Have Suggestions?
We would love to hear your feedback, questions, comments and suggestions. This will help us to make us better and more useful next time.
Share your thoughts and ideas at knowledgecenter@qasource.com