MapReduce

Introduction To MapReduce:

·       MapReduce is a computing model that decomposes larger manipulation jobs  into individual tasks.

·       These tasks can be executed parallel across the cluster.

·       The results of the tasks are joined together to form the final result.

·       MapReduce is the data processing component of Hadoop.

·       Mapreduce transforms the list of input data into list of output data elements.

·       Mapreduce is the heart of hadoop. It is designed for processing huge amount of data.

·       There are two different processing layers:

1.    Map

2.    Reduce

Different Phases in Mapreduce :

 

 

 

Map:

·       Map takes the set of data & convert into another set of data where individual elements are broken down into tuples(key, value pairs).

·       Here data can be in structured or unstructured format.

·       Key is reference to input value.(IntWritable, LongWritable)

·       Value is a dataset on which to operate.(IntWritable, LongWritable, TextWritable).

·       The output of map is  known as intermediate output, which is stored in local system.

·       The intermediate output of map is given as input to reduce phase.

·       If there is no reduce phase or if the processing of reduce is completed then,  the output is stored on hdfs.

·       The movement of output from mapper phase to reducer phase is known as shuffling.

·       The output of mapper can be different from input pair.

·       Different phases under map phase:

Ø Partitioner:Output of mapper is partitioned and filtered to many partitions by partitioner.

Ø Combiner: Before passing the output to reduce phase, combiner summarizes the output record with same key. Therefore combiner is known as “Mini-Reducer”

Reduce:

·       The input of reducer is intermediate output which is produced by mapper.

·       Keys , Value pairs provided to reducer are sortedby key.

·       Reducer is the second phase of map reduce.

·       An output of reduce phase is final output.

·       Different  aggregate operations like filter etc.,,  can be performed on reduce phase.

·       By default number of reducers is 1.

·       There are 3 phases of reducer in Mapreduce:

Ø Shuffling: The process of transferring output from mappers to reducers is known as shuffling.

Ø Sorting: The keys generated by mapper are automatically sorted by mapreduce. Values generated to the reducer are sorted which helps reducer, to easily distinguish when, a new reduce task should start.

Ø Reduce phase: Final output  is produced , after sorting and aggregate operations are performed.

·       We can set the count of reducers by using the method as follows:

Job.setNumReducerTask(Int).

By increasing the number of reducers,

Ø It increases the framework overhead

Ø Increases load balancing

Ø Lowers the cost of failures.

 

DataTypes:

Normal DataType                    MapReduce DataType

1.     Int                        :         Intwritable.

2.     Float                     :         Floatwritable.

3.     Double                 :         Double writable.

4.     Long                     :         Long Writable.

5.     String                   :         StringWritable.

6.     Boolean                :         BooleanWritable.

 


Comments

Popular posts from this blog

Hadoop

Problem Statement Of Real Estate Use Cases

Problem Statement Of Bank Marketing analysis