This is something everyone should have an idea about as we are the people behind this big scene of big data. The development of this platform is totally driven by the tremendous amount of data we generate in every internet second globally. These are in form of texts, images, videos, animations, graphs, sensor data, log files etc.

Let’s say we are good at storing the data. But this is not what the market demands. We need to process the same. Adding to this, mind it! your data input continues, you have to respond to the data requests, you need to consider the veracity of data and you don’t know how much data inputs and requests you are going to get at any instant of time. After all, you have given this freedom to millions of people with millions of devices in their hands.

So here the big data solution concept comes. When we have an ample amount of data that we cannot handle with our existing data storage and a processing unit, we can say, the data is really big for us. So is this all about the data size? No, let’s consider the below chart.

In the below image, we can get the idea, what type of a headache we are giving our data scientists every time we use WhatsApp, Facebook, Twitter, Youtube, or our mobile or laptop and you can add more internet-enabled devices you have with you.

 

things that happens in an internet minuteWow! It’s really big, isn’t it? And the data size becomes doubled every year. And it’s obvious, we can’t handle this Big data with our conventional infrastructure.

So it’s not about volume of data only. It’s about velocity and veracity of data too. We can now classify our data into categories like structured (data coming from RDBMS sources), semi-structured (XML and JSON documents) and unstructured (log files, pictures, audio files, communications records, email, Twitter feed – anything).

So, we should be really thankful for the engineers who work to develop the systems, those take care of all our data and provide us with the same within seconds whenever we need them.

Let’s say, to process this vast amount of data, we enhanced our system configuration. We increased our hard disk size, memory and go on. This is known as Scale up/Vertically Scaling. What will we do once we reach the level of a supercomputer? We cannot add any more parts to our system and thus the chance of Single point of failure (SPOF) increases. What will we do if our system goes down?

Let’s say, we have arranged a function and we have invited one thousand people. They are from different countries, different regions, different cultures and different living habits. So along with throwing a party, we have to take care of their accommodation, their transportation, their food habits, their instant needs etc.

Can you alone do that?

I think yes! If you were the flash, don’t know flash? Search Google for ‘Flash, the superhero’. So coming to the reality, we need a number of people in our side giving them various responsibilities for various works and someone needs to manage them in the middle. In this way, the work gets divided and gets completed soon. Likewise, to handle the big data we need multiple numbers of processors to handle it.

So here comes our hero – HADOOP FRAMEWORK. This framework is designed with master-slave architecture. The master node known as NameNode maintains the metadata of entire system (information about other nodes, directories, files) and manages the blocks which are present on the DataNodes (slave nodes) attached to it. If NameNode is down, the system can’t be accessed. The slave nodes known as DataNodes are deployed on each machine to provide the actual storage.

These are the nodes responsible for serving read and write requests for the clients. In the certain interval, DataNodes give updates to the NameNode so that, the NameNode remain assured that the data nodes are working. This is known as heartbeat (yes Boss, I am alive). If you need more clusters, add it. In this architecture, you can add any number of clusters without enhancing the single system configuration. This is known as Scale Down/ Horizontal Scaling. Another node known as secondary NameNode should not be assumed to be a backup copy of the NameNode. It is actually responsible for performing periodic checkpoints of the system which can be used to restart the NameNode in the event of NameNode failures.

So, we are not going to store our data on one server. Rather a number of clusters (DataNodes) are networked under the master node. They will keep your data distributive. So here comes the concept of HDFS (Hadoop Distributed File System). This is a part of HADOOP framework.

Let’s say one cluster gone down. I am sorry you loosed part of your data. Don’t worry. Hadoop has a way for this. This intelligent buddy has kept 3 replicas of parts of your data in various clusters. By default, Hadoop is programmed to break your data into packets of 128 MB and store them in different clusters.

Listed features of HADOOP FRAMEWORK

  1. HADOOP is highly scalable (any number of servers can be added as a node within a cluster of servers for processing).
  2. HADOOP Is Highly Flexible (any machine of any capability, can be added to a Hadoop cluster without any difficulty. It gives offers high economical flexibility).
  3. HADOOP Is Highly Economical (Since Hadoop is an Open Source project you don’t need to buy licenses to install Hadoop on new systems or to expand your Hadoop clusters).
  4. HADOOP Is Highly Fault Tolerant (Hadoop, by design, treats failure as a part of the system and therefore, it maintains the replica of data across multiple nodes within its cluster thus providing a high-level fault tolerance).
  5. HADOOP Brings Computations Closer To Data (Hadoop brings computations closer to the node where data is stored making data processing way faster. Moving computation is cheaper than Moving data, and in this way, we can overcome the network traffic difficulties if we try to move data rather than the computation).

We are not discussing the types of data as Hadoop is capable of handling any type of data. Coming to the processing part, to play around data we need to code. For this MapReduce is there for performing distributed data processing.

The components for MapReduce

JobTracker This is the master of the Map-Reduce system which manages the jobs and resources in the cluster (TaskTrackers). The JobTracker’s job is to try to schedule each map as closely as possible to the actual data being processed.

TaskTrackers – These are the slaves of Job Tracker which are deployed on each machine. They are responsible for running the map and reduce tasks as instructed by the JobTracker. Usually, Job tracker assigns the tasks of running the mapping or reducer tasks to Task Trackers that are local to the node where the data being processed resides.

JobHistoryServer – This is a daemon that serves historical information about completed applications. Typically, JobHistory server can be co-deployed with Job­Tracker. Up to batch processing (historical data processing), Map Reduce is fine, But coming to streaming data, map-reduce won’t help. And map reduce is also not user-friendly.

Streaming Data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, e-commerce purchases, in-game player activity, information from social networks, financial trading floors, or geospatial services, and telemetry from connected devices or instrumentation in data centres.

There are requirements in the real world, where, we need the results in seconds. i.e., fraud detection – Let’s say, someone else is trying to hamper your account. You immediately get a notification. Likewise, a number of jobs are there they need to process data immediately after it gets generated and give the result with latency in the order of seconds or milliseconds.

Concluding the Article

Streaming data processing requires two layers: a storage layer and a processing layer. The storage layer needs to support record ordering and strong consistency to enable fast, inexpensive and replayable reads and writes of large streams of data. The processing layer is responsible for consuming data from the storage layer, running computations on that data, and then notifying the storage layer to delete data that is no longer needed.

You also have to plan for scalability, data durability, and fault tolerance in both the storage and processing layers. As a result, many platforms have emerged that provide the infrastructure needed to build streaming data applications including Amazon Kinesis Streams, Amazon Kinesis Firehose, Apache Kafka, Apache Flume, Apache Spark Streaming, and Apache Storm.