MapReduce
Introduction To MapReduce:
· MapReduce is a computing model that decomposes larger manipulation
jobs into individual tasks.
· These tasks can be executed parallel
across the cluster.
· The results of the tasks are joined together to form the final
result.
· MapReduce is the data processing component
of Hadoop.
· Mapreduce transforms the list of input data into list of
output data elements.
· Mapreduce is the heart of hadoop. It
is designed for processing huge
amount of data.
· There are two different processing layers:
1. Map
2. Reduce
Different Phases in Mapreduce :
Map:
· Map takes the set of data &
convert into another set of data where individual elements are broken down into
tuples(key, value pairs).
· Here data can be in structured or
unstructured format.
· Key is reference to input
value.(IntWritable, LongWritable)
· Value is a dataset on which to
operate.(IntWritable, LongWritable, TextWritable).
· The output of map is
known as intermediate output, which is stored in local system.
· The intermediate output of map is given as input to reduce
phase.
· If there is no reduce phase or if the processing of reduce is
completed then, the output is stored on hdfs.
· The movement of output from mapper phase to reducer phase
is known as shuffling.
· The output of mapper can be different
from input pair.
· Different phases under map phase:
Ø Partitioner:Output
of mapper is partitioned and filtered to many partitions by partitioner.
Ø Combiner: Before passing the output to reduce
phase, combiner
summarizes the output record with same key. Therefore combiner is known
as “Mini-Reducer”
Reduce:
· The input of reducer is intermediate
output which is produced by mapper.
· Keys , Value pairs provided to reducer
are sortedby key.
· Reducer is the second phase of map reduce.
· An output of reduce phase is final output.
· Different aggregate operations like filter etc.,, can be performed on reduce phase.
· By default number of reducers is 1.
· There are 3 phases of reducer in
Mapreduce:
Ø Shuffling: The process of transferring output from mappers to reducers
is known as shuffling.
Ø Sorting: The keys generated by mapper are automatically sorted by mapreduce.
Values generated to the reducer are sorted which helps reducer, to easily
distinguish when, a new reduce task should start.
Ø Reduce phase: Final output is
produced , after sorting and aggregate operations are performed.
· We can set the count of reducers by
using the method as follows:
Job.setNumReducerTask(Int).
By increasing the number of reducers,
Ø It increases the framework overhead
Ø Increases load balancing
Ø Lowers the cost of failures.
DataTypes:
Normal DataType MapReduce
DataType
1.
Int : Intwritable.
2.
Float : Floatwritable.
3.
Double : Double writable.
4.
Long : Long Writable.
5.
String : StringWritable.
6.
Boolean : BooleanWritable.
Comments
Post a Comment