Hadoop Distribution File System

- January 25, 2020

EDGE Node :

Ø Edge nodes are the interface between the Hadoop cluster and the outside network.

Ø For this reason, they’re sometimes referred to as gateway nodes.

Ø Most commonly, edge nodes are used to run client applications and cluster administration tools.

Ø EdgeNode is machine which is part of cluster where client applications are installed.

NameNode :

Ø The NameNode is the centerpiece of an HDFS file system.

Ø It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept.

Ø It does not store the data of these files itself.

Ø Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file.

Ø The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

Ø The NameNode maintains two persistent files –

Ø A transaction log called an Edit Log and a namespace image called an FsImage.

Ø The Edit Log records every change that occurs in the file system metadata such as creating a new file.

Ø The Edit Log is stored in the NameNode’s local file system.

Ø The entire file system namespace including mapping of blocks, files, and file system properties is stored in FsImage.

Ø This is also stored in the NameNode’s local file system.

Benefits of Secondary Namenode :

Ø Secondary namenode does something called checkpoint process.

Ø 1.Secondary namenode gets editlogs and fsimage periodically from primary NN.

Ø 2.Secondary loads both the fsimage and editlogs to main memory and applies each operation from edits to fsimage.

Ø 3.The secondary copies new fsimage to primary and also updates the modified time of fsimage to fstime file, so now fsimage is now updated.

Ø Since fsimage is updated, there will be no overhead of copying of edit logs at the moment of restarting the cluster. At the same time editlogs file size will be always minimal(since changes are flushed to fsimage), adding up to the performance as well.

DataNode :

Ø DataNode is responsible for storing the actual data in HDFS.

Ø DataNode is also known as the Slave.

Ø NameNode and DataNode are in constant communication.

Ø When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for.

Ø When a DataNode is down, it does not affect the availability of data or the cluster. NameNode will arrange for replication for the blocks managed by the DataNode that is not available.

Ø DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode.

Different Hadoop Modes :

1. Local Mode or Standalone Mode

Standalone mode is the default mode in which Hadoop run. Standalone mode is mainly used for debugging where you don’t really use HDFS.
You can use input and output both as a local file system in standalone mode.

You also don’t need to do any custom configuration in the files- mapred-site.xml, core-site.xml, hdfs-site.xml.

Standalone mode is usually the fastest Hadoop modes as it uses the local file system for all the input and output. Here is the summarized view of the standalone mode-

· HDFS not being used.

· No need to change any configuration files.

· Default Hadoop Mode is Standalone Mode.

2. Pseudo-distributed Mode

The pseudo-distribute mode is also known as a single-node cluster where both NameNode and DataNode will reside on the same machine.

In pseudo-distributed mode, all the Hadoop daemons will be running on a single node. Such configuration is mainly used while testing when we don’t need to think about the resources and other users sharing the resource.

In this architecture, a separate JVM is spawned for every Hadoop components as they could communicate across network sockets, effectively producing a fully functioning and optimized mini-cluster on a single host.

Here is the summarized view of pseudo distributed Mode-

· Single Node Hadoop deployment running on Hadoop is considered as pseudo Distributed mode.

· All the Master & slave daemons will be running on the same node.

· Mainly used for testing purpose.

· Replication factor will be ONE for blocks.

· Changes in configuration files will be required for all the three files-mapred-site.xml, core-site.xml, hdfs-site.xml.

3. Fully-Distributed Mode (Multi-Node Cluster)

This is the production mode of Hadoop where multiple nodes will be running. Here data will bedistributed across several nodes and processing will be done on each node.

Search This Blog

TECH FOR U