TECH FOR U

Posts

Showing posts from October, 2018

Pig Split command

- October 24, 2018

============SPLIT================== Emp1 = LOAD '/user/cloudera/ N_EMP1.txt' USING PigStorage(',') as ( sno:int , name:chararray , role:chararray , salary:chararray , company:chararray , exp:int); Emp2 = LOAD '/user/cloudera/ N_EMP2.txt' USING PigStorage(',') as ( sno:int , name:chararray , role:chararray , salary:int , company:chararray , exp:int); The SPLIT operator is used to split a relation into two or more relations. SPLIT Emp1 into X if (exp > 3), Y if (exp < 3); dump X; ==========CROSS======== c = CROSS Emp1,Emp2; dump c; Join: TXN = LOAD '/user/cloudera/txns.txt' USING PigStorage(',') as (txnno:INT, txndate:CHARARRAY, custno:INT , amount:INT , category:CHARARRAY, product:CHARARRAY, city:CHARARRAY, state:CHARARRAY, spendby:CHARARRAY); CUTS = LOAD '/user/cloudera/custs' USING PigStorage(',') as (cusid:int, firstname: CHARARRAY, lastname: CHARARRAY, age:int,profession:...

MapReduce

- October 24, 2018

Optimization Techniques Map Reduce && Hive: • Indexing • Partitioning • Bucketing • Denormalization • Vectorization =SET hive.vectorized.execution.enabled=true • Input format selection(Type of file to be used) • Unit Testing • Sampling. MAPREDUCE COUNTERS: • Counters in MR are useful for gathering statistics about MR jobs like quality control and application level. • Each counter is defined by MP framework. • Counters are useful for problem solving diagnosis. • Each counter in MR is named by ‘ENUM’. • Hadoop counters validate the following:  It reads and written the correct number of bytes.  Each MR job has run correct number of tasks ornot.  Counters also validate amount of CPU or memory consumed is appropriate for our jobs & cluster nodes or not. Different types of counters in hadoop:  Built-in or pre-defined counters  User-defined counters or custom counters Built-in counters: • Apache hadoop contains some built-...