Posts

Showing posts from October, 2018

Pig Split command

============SPLIT================== Emp1 = LOAD '/user/cloudera/ N_EMP1.txt' USING PigStorage(',') as ( sno:int , name:chararray , role:chararray , salary:chararray , company:chararray , exp:int); Emp2 = LOAD '/user/cloudera/ N_EMP2.txt' USING PigStorage(',') as ( sno:int , name:chararray , role:chararray , salary:int , company:chararray , exp:int); The SPLIT operator is used to split a relation into two or more relations. SPLIT Emp1 into X if (exp > 3), Y if (exp < 3); dump X; ==========CROSS======== c = CROSS Emp1,Emp2;   dump c; Join: TXN = LOAD '/user/cloudera/txns.txt' USING PigStorage(',')   as (txnno:INT, txndate:CHARARRAY, custno:INT , amount:INT , category:CHARARRAY, product:CHARARRAY, city:CHARARRAY, state:CHARARRAY, spendby:CHARARRAY); CUTS = LOAD '/user/cloudera/custs' USING PigStorage(',')   as (cusid:int, firstname: CHARARRAY, lastname: CHARARRAY, age:int,profession:

MapReduce

Image
Optimization Techniques Map Reduce && Hive: • Indexing • Partitioning • Bucketing • Denormalization • Vectorization =SET hive.vectorized.execution.enabled=true • Input format selection(Type of file to be used) • Unit Testing • Sampling. MAPREDUCE COUNTERS: • Counters in MR are useful for gathering statistics about MR jobs like quality control and application level. • Each counter is defined by MP framework. • Counters are useful for problem solving diagnosis. • Each counter in MR is named by ‘ENUM’. • Hadoop counters validate the following:  It reads and written the correct number of bytes.  Each MR job has run correct number of tasks ornot.  Counters also validate amount of CPU or memory consumed is appropriate for  our jobs &  cluster nodes or not. Different types of counters in hadoop:  Built-in or pre-defined counters  User-defined counters or custom counters Built-in counters: • Apache hadoop contains some built-in counte