Thursday, August 11, 2016

Big Data and Hadoop Overview





What is Big Data?
It is a broad term for data sets, which is large or complex that traditional data processing applications are insufficient. 

Why we need to know about Big Data?
An increasing number of data sources such as social media and a growing number of media-rich data types such as videos are fueling the challenges in data analysis, capture, search, sharing, storage, transfer, visualization, and information privacy.

What is Apache Hadoop?
It is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. 
Some known features:

  •  Scalable: It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. 
  • High-Availability and Robust: Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer.

What are the Hadoop components?
Hadoop has four major components, which works together to provide better services:
  1. Hadoop Common : Common utilities to support other Hadoop modules
  2. Hadoop Distributed File System: Provides high-throughput access to application data. 
  3. Hadoop YARN: Framework for job scheduling and cluster management.
  4. Hadoop MapReduce: YARN-based system for parallel processing of large data set.
Note:
  • Node: Typically a computer.  
  • Rack: It is collections of multiple nodes, which are all connected to same network.
  • Cluster: It is collections of multiple racks. 

Architecture of Apache Hadoop
There are 2 major nodes in Hadoop Cluster


  1. HDFS Nodes
    • NameNode (one per cluster, which manages file-system and metadata.)
    • DataNode (Many per cluster, manages blocks with data and serves them to client, periodically reports to NameNode with block information)

  2. MapReduce Nodes
    • JobTracker (one per cluster, receives request from clients, schedule and monitor MapReduce jobs on TaskTracker)
    • TaskTracker (Many per cluster, which executes MapReduce operations)


Overview of Hadoop Cluster




Writing File to HDFS (Hadoop Distributed File System)

Note:File Block: 64 MB (default), 128 (recommended). Increasing file block size will reduce the seek time, which directly improves the performance of Hadoop. 

  1. Client submits the create request to NameNode. Then NameNode will check for the file existence and also verify whether client has write permission or not.
  2. Then NameNode will determine the DataNode, to write the first block of the file. (Note: If client is currently running on DataNode, then it will write the first block to the same DataNode otherwise it will pick random DataNode to write).
  3. Now, same block of data will be replicated to at-least two other places in the same cluster, which may resides in a same rack (as shown above). Again, DataNodes were randomly picked to write data blocks.
  4. Now, to ensure that data blocks were written successfully to DataNode, an acknowledgment will be sent back from last node to client in the reversible order. 
  5. Once client receives acknowledgment, then same process will be repeated for remaining blocks. 
  6. When client completes writing to all of the data blocks to DataNode and receives acknowledgement, then it tells to NameNode that “completed”.
  7. Then, NameNode will check the data block for minimal replication before responding. 

No comments:

Post a Comment