Let us dive deep into big data hadoop!: Joins In Hive

Creating a table:-

1. create table t1 (id int, name string, age int)row format delimited fields terminated by '\,';

2. create table t2 (id int, name string, age int)row format delimited fields terminated by '\,';

3. create table t3(id int, name string, age int, designation string, off_location string, salary double)row format delimited fields terminated by '\,';

Loading data into tables t1, t2 and t3:-

1. load data local inpath '/home/training/Desktop/t1.txt' into table t1;

2. load data local inpath '/home/training/Desktop/t2.txt' into table t2;

3. load data local inpath ‘/home/training/Desktop/t3.txt’ into table t3;

Retrieving all the rows from the tables:-

select * from t1;

select * from t2;

select * from t3;

Simple join:-

Hive supports only equality joins because it is primarily difficult to express non-equality joins in map-reduce.

Syntax: select table_name_1.ColumnValue, table_name_2.ColumnValue from table_name_1 JOIN table_name_2 on (table_name_1.ColumnValue=table_name_2.ColumnValue)

select t1.id,t1.name,t1.age,t2.name,t2.age from t1 JOIN t2 ON(t1.id=t2.id);

select t1.id,t1.name,t2.age from t1 JOIN t2 ON(t1.id=t2.id);

Select t3.id,t3.designation,t2.name,t1.age from t3 join t2 on(t3.id=t2.id) join t1 on(t3.id=t1.id);

Inner join:

Inner joins are commutative; however, should the user prefer having t1 as the left table and t2 as the right table or vice-versa? The answer to this question lies in understanding how Hive compiles a join query to MapReduce. In the MapReduce job for regular inner joins, mappers run on both tables, emitting out records from that need to be joined by evaluating any UDFs in the query and filtering out any records based on the where clause. Then the shuffle phase is run which “shuffles” the keys based on the join key (id in above example). Subsequently, in the reduce phase, essentially a cross-product takes place between records from each table that have the same join key. During this reduce phase, records from all tables but the last table are buffered in memory while the records from query’s last table are streamed into the reducer. Consequently, in order to reduce the memory required by reducers, it’s recommended to put the largest tables last to ensure their records are streamed into the reducers.

Select t2.name, t3.designation, t3.off_location from t2 JOIN t3 where t2.id=t3.id AND t2.name=t3.name;

Outer Join:

The table in an Outer Join that must return all rows. For left outer joins this is the Left table, for right outer joins it is the Right table, and for full outer joins both tables are Preserved Row tables.

Left outer join:-

Select t3.off_location, t3.id, t1.id from t3 left outer join t1 on t3.id=t1.id;

Right outer join:-

Select t3.off_location, t3.id, t1.id from t3 right outer join t1 on t3.id=t1.id;

Full Outer Join

Select t3.off_location, t3.id, t1.id from t3 full outer join t1 on t3.id=t1.id;

Let us dive deep into big data hadoop!

Friday, 14 August 2015

Joins In Hive

No comments:

Post a Comment

Hive

Blog Archive