MapReduce

Optimization Techniques Map Reduce && Hive:
Indexing
Partitioning
Bucketing
Denormalization
Vectorization =SET hive.vectorized.execution.enabled=true
Input format selection(Type of file to be used)
Unit Testing
Sampling.

MAPREDUCE COUNTERS:
Counters in MR are useful for gathering statistics about MR jobs like quality control and application level.
Each counter is defined by MP framework.
Counters are useful for problem solving diagnosis.
Each counter in MR is named by ‘ENUM’.
Hadoop counters validate the following:
It reads and written the correct number of bytes.
Each MR job has run correct number of tasks ornot.
Counters also validate amount of CPU or memory consumed is appropriate for  our jobs &  cluster nodes or not.
Different types of counters in hadoop:
Built-in or pre-defined counters
User-defined counters or custom counters
Built-in counters:
Apache hadoop contains some built-in counters which report various metrics.
These counters allow us to confirm that the expected amount of input is consumed and expected amount of output is produced.
Hadoop counters are divided into groups, each and every group either has task counters or job counters.
Different types of built-in in counters are:
MapReduce Task Counters
File system Counters
FileInput Format Counters
FileOuput Format counters
Job Counters
MapReduce Task Counter:
Task counter collect the information about tasks during the execution time, which include number of records read and written.
For example: MAP_INPUT_RECORDS counter, this describes the number of inputs read by each map tasks.




File System Counters:
These counters gather information like number of bytes read and written by the file system.
The name description of the fie system will be as follows:
  FileSystem bytes read: Number of bytes read by the file system.
  FileSystem bytes written: Number of bytes written to the file system.
FileInput Format Counters:
These Counters also gather information of a number of bytes read by map tasks via FileInputFormat.
FileOutput Format Counters:
These counters also gather information of a number of bytes written by map tasks (for map-only jobs) or reduce tasks via FileOutputFormat.

Job Counters in MapReduce:
Job counter measures the job-level statistics.
It does not measure values that change while a task is running.
TOTAL_LAUNCHED_MAPS, count the number of map tasks that were launched over the course of a job.
User-Defined Counters or Custom Counters:
MR allows user code to define set of counters. Then it increments them in mapper or reducer.
A job may define an arbitrary number of ‘enums’. Each with an arbitrary number of fields. The name of the enum is the group name. The enum’s fields are the counter names.
Java’s enums are defined at compile time, we cannot create new counters at runtime using enums.
So we create dynamic counters to create new counters at runtime, but dynamic counters are not defined at compile time.

JOB SCHEDULERS IN HADOOP:
Schedulers play important role in job task.
FIFO, Capacity, Fair schedulers are default schedulers in hadoop, later many improved schedulers came into existence.

FIFO Schedulers:
The original scheduling algorithm that was integrated within the JobTracker was called FIFO.
In FIFO scheduling, a JobTracker pulled jobs from a work queue, oldest job first.
This schedule had no concept of the priority or size of the job, but the approach was simple to implement and efficient.
Problem:short jobs stuck behind long ones.

Fair Schedulers:
The core idea behind the fair scheduler was to assign resources to jobs such that on average over time, each job gets an equal share of the available resources.
The result is that jobs that require less time are able to access the CPU and finish faster than with the execution of jobs that require more time to execute.
The fair scheduler was developed by Facebook.
The Fair Scheduler lets all apps run by default, but it is also possible to limit the number of running apps per user and per queue through the
Limiting the apps does not cause any sOptimization Techniques Map Reduce && Hive:
·       Indexing
·       Partitioning
·       Bucketing
·       Denormalization
·        Vectorization =SET hive.vectorized.execution.enabled=true
·       Input format selection(Type of file to be used)
·       Unit Testing
·       Sampling.

MAPREDUCE COUNTERS:
·       Counters in MR are useful for gathering statistics about MR jobs like quality control and application level.
·       Each counter is defined by MP framework.
·       Counters are useful for problem solving diagnosis.
·       Each counter in MR is named by ‘ENUM’.
·       Hadoop counters validate the following:
Ø It reads and written the correct number of bytes.
Ø Each MR job has run correct number of tasks ornot.
Ø Counters also validate amount of CPU or memory consumed is appropriate for  our jobs &  cluster nodes or not.
Different types of counters in hadoop:
J Built-in or pre-defined counters
J User-defined counters or custom counters
Built-in counters:
·       Apache hadoop contains some built-in counters which report various metrics.
·       These counters allow us to confirm that the expected amount of input is consumed and expected amount of output is produced.
·       Hadoop counters are divided into groups, each and every group either has task counters or job counters.
·       Different types of built-in in counters are:
ÿ MapReduce Task Counters
ÿ File system Counters
ÿ FileInput Format Counters
ÿ FileOuput Format counters
ÿ Job Counters
v MapReduce Task Counter:
Task counter collect the information about tasks during the execution time, which include number of records read and written.
For example: MAP_INPUT_RECORDS counter, this describes the number of inputs read by each map tasks.




v File System Counters:
·       These counters gather information like number of bytes read and written by the file system.
·       The name description of the fie system will be as follows:
*   FileSystem bytes read: Number of bytes read by the file system.
*   FileSystem bytes written: Number of bytes written to the file system.
v FileInput Format Counters:
These Counters also gather information of a number of bytes read by map tasks via FileInputFormat.
v FileOutput Format Counters:
These counters also gather information of a number of bytes written by map tasks (for map-only jobs) or reduce tasks via FileOutputFormat.

v Job Counters in MapReduce:
·       Job counter measures the job-level statistics.
·       It does not measure values that change while a task is running.
·       TOTAL_LAUNCHED_MAPS, count the number of map tasks that were launched over the course of a job.
User-Defined Counters or Custom Counters:
·               MR allows user code to define set of counters. Then it increments them in mapper or reducer.
·               A job may define an arbitrary number of ‘enums’. Each with an arbitrary number of fields. The name of the enum is the group name. The enum’s fields are the counter names.
·               Java’s enums are defined at compile time, we cannot create new counters at runtime using enums.
·               So we create dynamic counters to create new counters at runtime, but dynamic counters are not defined at compile time.

JOB SCHEDULERS IN HADOOP:
·       Schedulers play important role in job task.
·       FIFO, Capacity, Fair schedulers are default schedulers in hadoop, later many improved schedulers came into existence.

v FIFO Schedulers:
·       The original scheduling algorithm that was integrated within the JobTracker was called FIFO.
·       In FIFO scheduling, a JobTracker pulled jobs from a work queue, oldest job first.
·               This schedule had no concept of the priority or size of the job, but the approach was simple to implement and efficient.
·             Problem:short jobs stuck behind long ones.

v Fair Schedulers:
·               The core idea behind the fair scheduler was to assign resources to jobs such that on average over time, each job gets an equal share of the available resources.
·                The result is that jobs that require less time are able to access the CPU and finish faster than with the execution of jobs that require more time to execute.
·               The fair scheduler was developed by Facebook.
·               The Fair Scheduler lets all apps run by default, but it is also possible to limit the number of running apps per user and per queue through the
·               Limiting the apps does not cause any subsequently submitted apps to fail,only to wait in the scheduler’s queue until some of the user’s earlier apps finish.
v Capacity Scheduler:
·               In capacity scheduling, instead of pools, several queues are created, each with a configurable number of map and reduce slots.
·               Each queue is also assigned a guaranteed capacity(where the overall capacity of the cluster is the sum of each queue's capacity)..
·               Queues are monitored,if a queue is not consuming its allocated capacity, this excess capacity can be temporarily allocated to other queues.
·               We configure the capacity scheduler within multiple Hadoop configuration files.
·               The queues are defined within hadoop-site.xml, and the queue configurations are set in capacity-scheduler.xml.
DISTRIBUTED CACHE IN HADOOP:
·               Distributed cache is the framework which is provided by MR.
·               It caches files whenever needed by the file system applications.
·               It can read only text files, jars, archive files etc.
·               Once we have cached file for our job hadoop will make it available on each datanode where map or reduce tasks are running.

v Working and Implementation of Distributed cache:
ü Should make sure that file is available
ü Make sure that file can be accessed via URLs. URLs can either be hdfs://or http://.
ü Copy the pre-requisite file to HDFS
hadoop fs –put filename /user/cloudera/filename.
ü Setup the applications job config:
DistributedCache.addFileToClasspath(new Path(“location”), conf)
ü Add it in driver class.
·               Default size of distributed cache is 10GB.
·               The size is configurable,  which can be changed in mapred-site.xml

v Advantages:
Ø Single Point of failure: failure of single node doenot effect whole cluster.
Ø Stores complex data.
Ø Data consistency


Record compressed and uncompressed:
Each call to the append() adds record to the sequence file which contains length of whole record(Length of key+raw data of the key+key value).
Block Compressed:
In this format, data is  not written until it reaches a threshold limit and when threshold is reached all keys are compressed together.
Compression and Compression codec’s in MapReduce:
File compression brings two major benefits:
ü It reduces the space needed to store files
ü It speeds up data transfer across the network or to or from disk.

Compressing input files:
Ø If input file is compressed then, the bytes readfrom HDFS is reducedwhich means less time to read data.
Ø If the input files are compressed, they will be decompressed automatically as they are read by MapReduce, using the filename extension to determine which codec to use.


Compressing Output files:
It is necessary to compress the output files, before storing on top of HDFS.
Different Compressing techniques:
J Gzip
J Bzip2
J LZO
J Snappy
v Gzip:
Ø gzip is naturally supported by hadoop, gzip is based on deflate algorithm , which is combination of LZ77and Huffman coding.
Ø Provides high compression ratio.
Ø Uses high CPU resources for  compressing and decompressing data.
Ø Good choice for cold data which is infrequently accessed.
Ø Compressed data is not splittable hence, not suitable for MR job
v Bzip2:
Ø Bzip2 generates a better compression ratio than does Gzip, but it’s much slower.
Ø It typically compresses files to within 10% to 15% of the best available techniques.
Ø Takes long time to compress and decompress data.
Ø Compressed data is splittable, but not suitable for MR jobs as compression and decompression time is more.

v LZO Compression technique:
Ø LZO compression format composed of many smaller blocks of compressed data.
Ø LZO supports splittable compression, which enables the parallel processing of compressed text file splits by your MapReduce jobs.
Ø Provides low compression ratio.
Ø Very fast in compressing and decompressing.
Ø Compressed data is splittable if, suitable indexing algorithms are used.
Ø Best suitable for MR jobs.
Ø mapred.compress.map.output= true to enable LZO Compression.

v Snappy:
Ø It is compression/decompression library.
Ø It aims for very high speeds and reasonable compression.
Ø On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.
Ø Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems.

OPTIMIZATION TECHNIQUES:
*   Proper configuration of your cluster.
*   LZO compression usage.
*   Proper tuning of  number mapreduce tasks.
*   Combiner between mapper and reducer.
*   Usage of most appropriate and compact writable type for data.
*   Reusage of Writables.

v Proper configuration of your cluster:
Ø With noatime option DFS and Mapreduce storage are mounted.This will disable access time, which improves I/O performance.
Ø Try to avoid RAID on task tracker and datanode machines as they generally reduce the performance.
Ø Ensure thar mapred.local.dir and dfs.data.dir point to one directory  on each of your disks which ensures that all your I/O capacity is used.
Ø If you see swap is being used, you need to reduce the amount of RAM allocated to eachtask in mapred.child.java.opts



v LZO Compression usage:
·       Set mapred.compress.map.output to true to enable LZO compression.
·       Although LZO adds a little bit of overhead to CPU, it saves time by reducing the amount of disk IO during the shuffle.
v Proper tuning of  number mapreduce tasks:
·       If a job has more than 1TB of input. Then you should consider increasing the block size of the input dataset to 256M or even 512M. So the number of tasks will be smaller.
·       You can change the block size by using the command Hadoop distcp –Hdfs.block.size=$[256*1024*1024] /path/to/inputdata /path/to/inputdata-with-largeblocks.


Piggy Bank Utility:::: <<<Jar in Downloads>>>

Converting text file into Avro (In Pig)
·      Load the data
cust= LOAD '/user/cloudera/custs' USING PigStorage(',') AS(cusId:int, name1:chararray, name2:chararray,age:int, profession:chararray);ubsequently submitted apps to fail,only to wait in the scheduler’s queue until some of the user’s earlier apps finish.
Capacity Scheduler:
In capacity scheduling, instead of pools, several queues are created, each with a configurable number of map and reduce slots.
Each queue is also assigned a guaranteed capacity(where the overall capacity of the cluster is the sum of each queue's capacity)..
Queues are monitored,if a queue is not consuming its allocated capacity, this excess capacity can be temporarily allocated to other queues.
We configure the capacity scheduler within multiple Hadoop configuration files.
The queues are defined within hadoop-site.xml, and the queue configurations are set in capacity-scheduler.xml.
DISTRIBUTED CACHE IN HADOOP:
Distributed cache is the framework which is provided by MR.
It caches files whenever needed by the file system applications.
It can read only text files, jars, archive files etc.
Once we have cached file for our job hadoop will make it available on each datanode where map or reduce tasks are running.

Working and Implementation of Distributed cache:
Should make sure that file is available
Make sure that file can be accessed via URLs. URLs can either be hdfs://or http://.
Copy the pre-requisite file to HDFS
hadoop fs –put filename /user/cloudera/filename.
Setup the applications job config:
DistributedCache.addFileToClasspath(new Path(“location”), conf)
Add it in driver class.
Default size of distributed cache is 10GB.
The size is configurable,  which can be changed in mapred-site.xml

Advantages:
Single Point of failure: failure of single node doenot effect whole cluster.
Stores complex data.
Data consistency


Record compressed and uncompressed:
Each call to the append() adds record to the sequence file which contains length of whole record(Length of key+raw data of the key+key value).
Block Compressed:
In this format, data is  not written until it reaches a threshold limit and when threshold is reached all keys are compressed together.
Compression and Compression codec’s in MapReduce:
File compression brings two major benefits:
It reduces the space needed to store files
It speeds up data transfer across the network or to or from disk.

Compressing input files:
If input file is compressed then, the bytes readfrom HDFS is reducedwhich means less time to read data.
If the input files are compressed, they will be decompressed automatically as they are read by MapReduce, using the filename extension to determine which codec to use.


Compressing Output files:
It is necessary to compress the output files, before storing on top of HDFS.
Different Compressing techniques:
Gzip
Bzip2
LZO
Snappy

Gzip:
gzip is naturally supported by hadoop, gzip is based on deflate algorithm , which is combination of LZ77and Huffman coding.
Provides high compression ratio.
Uses high CPU resources for  compressing and decompressing data.
Good choice for cold data which is infrequently accessed.
Compressed data is not splittable hence, not suitable for MR job
Bzip2:
Bzip2 generates a better compression ratio than does Gzip, but it’s much slower.
It typically compresses files to within 10% to 15% of the best available techniques.
Takes long time to compress and decompress data.
Compressed data is splittable, but not suitable for MR jobs as compression and decompression time is more.

LZO Compression technique:
LZO compression format composed of many smaller blocks of compressed data.
LZO supports splittable compression, which enables the parallel processing of compressed text file splits by your MapReduce jobs.
Provides low compression ratio.
Very fast in compressing and decompressing.
Compressed data is splittable if, suitable indexing algorithms are used.
Best suitable for MR jobs.
mapred.compress.map.output= true to enable LZO Compression.

Snappy:
It is compression/decompression library.
It aims for very high speeds and reasonable compression.
On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.
Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems.

OPTIMIZATION TECHNIQUES:
  Proper configuration of your cluster.
  LZO compression usage.
  Proper tuning of  number mapreduce tasks.
  Combiner between mapper and reducer.
  Usage of most appropriate and compact writable type for data.
  Reusage of Writables.

Proper configuration of your cluster:
With noatime option DFS and Mapreduce storage are mounted.This will disable access time, which improves I/O performance.
Try to avoid RAID on task tracker and datanode machines as they generally reduce the performance.
Ensure thar mapred.local.dir and dfs.data.dir point to one directory  on each of your disks which ensures that all your I/O capacity is used.
If you see swap is being used, you need to reduce the amount of RAM allocated to eachtask in mapred.child.java.opts



LZO Compression usage:
Set mapred.compress.map.output to true to enable LZO compression.
Although LZO adds a little bit of overhead to CPU, it saves time by reducing the amount of disk IO during the shuffle.
Proper tuning of  number mapreduce tasks:
If a job has more than 1TB of input. Then you should consider increasing the block size of the input dataset to 256M or even 512M. So the number of tasks will be smaller.
You can change the block size by using the command Hadoop distcp –Hdfs.block.size=$[256*1024*1024] /path/to/inputdata /path/to/inputdata-with-largeblocks.


Piggy Bank Utility:::: <<<Jar in Downloads>>>

Converting text file into Avro (In Pig)
Load the data
cust= LOAD '/user/cloudera/custs' USING PigStorage(',') AS(cusId:int, name1:chararray, name2:chararray,age:int, profession:chararray);

Comments

Popular posts from this blog

Problem Statement Of Real Estate Use Cases

Problem Statement Of Bank Marketing analysis

Hadoop