Hadoop Distribution File System
EDGE Node :
Ø Edge nodes
are the interface between the Hadoop cluster and the outside network.
Ø For this reason, they’re sometimes referred to
as gateway nodes.
Ø Most commonly, edge nodes are used to run
client applications and cluster administration tools.
Ø EdgeNode is
machine which is part of cluster where client applications are installed.
NameNode :
Ø The NameNode
is the centerpiece of an HDFS file system.
Ø It keeps the
directory tree of all files in the file system, and tracks where across the
cluster the file data is kept.
Ø It does not store the data of these files
itself.
Ø Client
applications talk to the NameNode whenever they wish to locate a file, or when
they want to add/copy/move/delete a file.
Ø The NameNode responds the successful requests
by returning a list of relevant DataNode servers
where the data lives.
Ø The NameNode
maintains two persistent files –
Ø A transaction
log called an Edit Log and a namespace image called an FsImage.
Ø The Edit
Log records every change that occurs in the file system metadata such as
creating a new file.
Ø The Edit Log
is stored in the NameNode’s local file system.
Ø The entire file
system namespace including mapping of
blocks, files, and file system properties is stored in FsImage.
Ø This is also
stored in the NameNode’s local file system.
Benefits of Secondary Namenode :
Ø Secondary namenode does something
called checkpoint process.
Ø 1.Secondary
namenode gets editlogs and fsimage periodically from primary NN.
Ø 2.Secondary
loads both the fsimage and editlogs to main memory and applies each operation
from edits to fsimage.
Ø 3.The
secondary copies new fsimage to primary and also updates the modified time of
fsimage to fstime file, so now fsimage is now updated.
Ø Since fsimage is updated,
there will be no overhead of copying of edit logs at the moment of restarting
the cluster. At the same time editlogs file size will be always minimal(since
changes are flushed to fsimage), adding up to the performance as well.
DataNode :
Ø DataNode is
responsible for storing the actual data in HDFS.
Ø DataNode is
also known as the Slave.
Ø NameNode and
DataNode are in constant communication.
Ø When a
DataNode starts up it announce itself to the NameNode along with the list of
blocks it is responsible for.
Ø When a
DataNode is down, it does not affect the availability of data or the cluster.
NameNode will arrange for replication for the blocks managed by the DataNode
that is not available.
Ø DataNode is
usually configured with a lot of hard disk space. Because the actual data is
stored in the DataNode.
Different Hadoop Modes :
1. Local Mode or Standalone Mode
Standalone mode is the
default mode in which Hadoop run. Standalone mode is mainly used for debugging where you don’t
really use HDFS.
You can use input and
output both as a local file system in standalone mode.
You also don’t need to do any
custom configuration in the files- mapred-site.xml, core-site.xml, hdfs-site.xml.
Standalone mode is usually
the fastest Hadoop modes as it uses the local file system for all the input and
output. Here is the summarized view of the standalone mode-
·
HDFS not
being used.
·
No need
to change any configuration files.
·
Default
Hadoop Mode is Standalone Mode.
2. Pseudo-distributed Mode
The pseudo-distribute
mode is also known as
a single-node cluster where both NameNode and DataNode will reside
on the same machine.
In
pseudo-distributed mode, all the Hadoop daemons will be running on a single
node. Such configuration is mainly used while testing when we don’t need to
think about the resources and other users sharing the resource.
In this
architecture, a separate JVM is spawned for every Hadoop components as they
could communicate across network sockets, effectively producing a fully
functioning and optimized mini-cluster on a single host.
Here is the
summarized view of pseudo distributed Mode-
·
Single Node Hadoop deployment running on Hadoop is
considered as pseudo Distributed mode.
·
All the Master & slave daemons will be running on
the same node.
·
Mainly used for testing purpose.
·
Replication factor will be ONE for blocks.
·
Changes in configuration files will be required for
all the three files-mapred-site.xml, core-site.xml, hdfs-site.xml.
3. Fully-Distributed Mode (Multi-Node Cluster)
This is the production
mode of Hadoop where
multiple nodes will be running. Here data will bedistributed across several
nodes and processing will be done on each node.
Master and Slave services will be
running on the separate nodes in fully-distributed Hadoop Mode.
·
Production
phase of hadoop.
·
Separate
nodes for master and slave daemons.
·
Data are
used distributed across multiple nodes.
Default
Configuration files are
· core-site.xml
Comments
Post a Comment