Sunday, 23 August 2015

Distribute by and Sort by clause in hive

DISTRIBUTE BY controls how map output is divided among reducers. By default, Map Reduce computes a hash on the keys output by mappers and tries to evenly distribute the key-value pairs among the available reducers using the hash values. Say we want the data for each value in a column to be captured together. We can use DISTRIBUTE BY to ensure that the records for each go to the same reducer. DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers receive rows for processing, Note that Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause if it's in same query .

For example, consider the following query without using sort by


Select t3.id, t3.name, t3.salary, t3.off_location from t3 distribute by t3.off_location;









Now, consider the query with sort by.

Select t3.id, t3.name, t3.salary, t3.off_location from t3 distribute by t3.off_location sort by t3.salary desc;


Friday, 14 August 2015

Joins In Hive


Creating a table:-

1.      create table t1 (id int, name string, age int)row format delimited fields terminated by '\,';
2.      create table t2 (id int, name string, age int)row format delimited fields terminated by '\,';
3.      create table t3(id int, name string, age int, designation string, off_location string, salary double)row format delimited fields terminated by '\,';

Loading data into tables t1, t2 and t3:-

1.      load data local inpath '/home/training/Desktop/t1.txt' into table t1;
2.      load data local inpath '/home/training/Desktop/t2.txt' into table t2;
3.      load data local inpath ‘/home/training/Desktop/t3.txt’ into table t3;

Retrieving all the rows from the tables:-

select * from t1;













select * from t2;











select * from t3;









Simple join:-

Hive supports only equality joins because it is primarily difficult to express non-equality joins in map-reduce.
Syntax: select table_name_1.ColumnValue, table_name_2.ColumnValue from table_name_1 JOIN table_name_2 on (table_name_1.ColumnValue=table_name_2.ColumnValue)

select t1.id,t1.name,t1.age,t2.name,t2.age from t1 JOIN t2 ON(t1.id=t2.id);






select t1.id,t1.name,t2.age from t1 JOIN t2 ON(t1.id=t2.id);







Select t3.id,t3.designation,t2.name,t1.age from t3 join t2 on(t3.id=t2.id) join t1 on(t3.id=t1.id);






Inner join:

Inner joins are commutative; however, should the user prefer having t1 as the left table and t2 as the right table or vice-versa? The answer to this question lies in understanding how Hive compiles a join query to MapReduce. In the MapReduce job for regular inner joins, mappers run on both tables, emitting out records from that need to be joined by evaluating any UDFs in the query and filtering out any records based on the where clause. Then the shuffle phase is run which “shuffles” the keys based on the join key (id in above example). Subsequently, in the reduce phase, essentially a cross-product takes place between records from each table that have the same join key. During this reduce phase, records from all tables but the last table are buffered in memory while the records from query’s last table are streamed into the reducer. Consequently, in order to reduce the memory required by reducers, it’s recommended to put the largest tables last to ensure their records are streamed into the reducers.

Select t2.name, t3.designation, t3.off_location from t2 JOIN t3 where t2.id=t3.id AND t2.name=t3.name;







Outer Join:
The table in an Outer Join that must return all rows. For left outer joins this is the Left table, for right outer joins it is the Right table, and for full outer joins both tables are Preserved Row tables.
Left outer join:-
Select t3.off_location, t3.id, t1.id from t3 left outer join t1 on t3.id=t1.id;









Right outer join:-
Select t3.off_location, t3.id, t1.id from t3 right outer join t1 on t3.id=t1.id;







Full Outer Join
Select t3.off_location, t3.id, t1.id from t3 full outer join t1 on t3.id=t1.id;













Lateral view explode() in hive


Lateral view is used in conjunction with user-defined table generating functions such as explode (). A UDTF generates zero or more output rows for each input row. A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.

Input file:-

Column1          column2
One                 [1, 2, 3]
Two                 [3, 4, 5]

Create table lateral1 (column1 string, column2 array<int>) row format delimited collection items terminated by ‘,’ stored as textfile;

Now exploding the list of elements in an array using lateral view explode () functions:-

SELECT column1, columns
FROM lateral1 LATERAL VIEW explode (column2) lateralTable AS column;

The output result will be:-

One     1
One     2
One     3
Two     3
Two     4

Two     5

Wednesday, 12 August 2015

Views and Indexes in hive

Views:-

Views are a logical construct that can be used to simplify queries by either abstracting away complexities such as joins or sub-queries or by pre-filtered or restricting data.

Input file:-
1, San Jose, Men’s Clothing, 214.05, Amex
2, Fort Worth, Women’s Clothing, 153.57, Visa
3, San Diego, Music, 66.08, Cash
4, Pittsburgh, Pet Supplies, 493.51, Discover
5, New York, Consumer Electronics, 296.8, Cash
6, Corpus Christi, Toys, 25.38, Discover
7, Fort Worth, Toys, 213.88, Visa
8, Greensboro, DVDs, 290.82, MasterCard
9, Riverside, Consumer Electronics, 252.88, Cash
10,Reno,Crafts,88.25,Visa

The following query creates a managed table for which the view is to be applied -

Create table view_sample (id int, city string, particular string, amount float, cardtype string)
Row format delimited fields terminated by ','
Stored as textfile;

The following query loads data into table view_sample-

Load data local inpath 'views' overwrite into table view_sample;

The below query creates a view of all data for the amount greater than 100 -

Create view view1 as select * from view_sample where amount>100;

Now retrieving the rows from the created view-

Select * from view1;

Output of the view:-
1          San Jose           Men's Clothing            214.05 Amex
2          Fort Worth      Women's Clothing      153.57 Visa
4          Pittsburgh        Pet Supplies     493.51 Discover
5          New York         Consumer Electronics 296.8   Cash
7          Fort Worth      Toys     213.88 Visa
8          Greensboro     DVDs   290.82 MasterCard
9          Riverside         Consumer Electronics 252.88 Cash
Time taken: 19.958 seconds

Dropping a view:-

Drop view view1;

Index:-

Indexes are a critical for efficiently retrieving rows from tables in less time. The following query creates an index for a table column amount-

Create index index_amount on table view_sample (amount) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild;

WITH DEFERRED REBUILD portion of the command prevents the index from immediately being built.

Now retrieving the rows of a particular column for which the index is been built -

Select amount from view_sample;

Output of the indexed column:-
214.05
153.57
66.08
493.51
296.8
25.38
213.88
290.82
252.88
88.25
Time taken: 16.026 seconds

Dropping an index:-

Drop index index_amount on view_sample;

Alter table Statements in hive


Hive allows to add additional table properties or modify existing properties of a table using alter statements, but not to remove them.

Renaming a table column:-

The following alter query renames the column name and its datatype-

Alter table clus3 change column id1 id_1 int after name;

Modifying existing table storage properties:-

The following query will modify the properties of array:-

Alter table clus2 set SERDEPROPERTIES ('colection.delim'= '=');

The following query will modify the properties of row fields:-

Alter table clus2 set SERDEPROPERTIES ('field.delim'= ':');

The following query will modify the properties of Map datatype:-


Alter table clus2 set SERDEPROPERTIES ('mapkey.delim'= '|');

Bucketing in Hive


The main difference between Hive partitioning and bucketing is, when we do partitioning, we create a partition for each unique value of the column. But there may be situation where we need to create lot of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory level and bucketing is a file level.

Prior to bucketing, it is important to set the following property:-

Set hive.enforce.bucketing = true;

Bucketing table can be created by using clustered by keyword. The following query creates a bucketed table in hive by clustering id1 as 5 buckets -

create table clus3 (id1 int, name string) 
partitioned by (dept string)
clustered by (id1) into 5 buckets
row format delimited
fields terminated by '\t';

The following query creates a managed partitioned table that holds temporary values prior to move to the bucketed tables.

Create table Partition2 (id1 int, name string, dept string) row format delimited fields terminated by '\t';

Now the data is loaded into the temporary table.

Load data local inpath '/home/training/Desktop/part1.txt' into table Partition2;

Now the data from temporary table is loaded into the bucketed table.


Insert overwrite table clus3 partition (dept) select id1, name, dept from partition2;

Partitioning in Hive


Partitioning of data is often used for distributing load horizontally, moving data physically closer to its most frequent users, and other purposes. Hive has the notion of partitioned tables for producing performance benefits, and they can help organize data in a logical fashion, such as hierarchically.

Query- Create table Partition2 (id1 int, name string) 
Partitioned by (dept string)
Row format delimited
Fields terminated by '\t';

You can view the partitions of a table by using SHOW PARTITIONS command:

Query- Show partitions Partition2;

You can also view the number of partitions created in HDFS by typing 

“localhost.localdomain:50075” in a web browser.

Loading data into partitioned table:-

Loading the data into the partitioned table can be of static or dynamic depending upon the data you load into the table.

Static Partition:-

Static partition requires individual loads of data to be loaded into the table for each partitions.
Query -load data local inpath '/home/training/Desktop/part1.txt' into table Partition1 partition (dept="cse");

 Dynamic Partition:-

Loading data into dynamic partitions allows single insert to the partitioned table. The following properties should be enabled while performing the dynamic partitions

Query - set hive.exec.dynamic.partiton=true;
set hive.exec.dynamic.partiton.mode=nonstrict;
set hive.exec.max.dynamic.partiton.pernode=1000;
load data local inpath '/home/training/Desktop/part1.txt' into table Partition1 partition (dept);

To view the data from a particular partition:-

Query -Show partitions clus2 partition (dept="cse");

It is also possible to add a new partition to the existing tables.

Query - Alter table clus2 add if not exists partition (year=2015);

To drop a partition from a table:-

Query - Alter table clus2 drop partition (dept=”cse”);

Loading data into Hive tables

Since Hive has no row-level insert, update, and delete operations, the only way to put data into a hive table is to use “bulk” load operations. The following query can be used to load data from local file system to the hive table stored in hdfs.

Query- Load data local inpath ‘/home/training/src.txt’ into table tab11;

When it’s necessary to load the input file from an hdfs location

Query- Load data inpath ‘src.txt’ into table tab11;

Overwrite command can be used when the hive table already holds data and to replace the existing values


Query- Load data local inpath ‘/home/training/src.txt’ overwrite into table tab11;

Storing an output:-

The following query stores the output  in local directory by creating a folder “output” in the specified path.

Query- Insert overwrite local directory ‘/home/training/output’ select * from tab11;

The following query stores the output of query in hdfs.

Query- Insert overwrite directory ‘/output’ select * from tab11.

Monday, 10 August 2015

Creating a table in hive

Creating a Table:-

Managed table:-

The managed tables are sometimes called as internal tables, because Hive controls the lifecycle of their data (more or less). Hive stores the data for these tables in a subdirectory under the directory defined by hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default. When we drop a managed table, Hive deletes the entire data in the table. However, managed tables are less convenient for sharing with other tools.

Query- create table if not exists tab11 (id int, name string, sal bigint, sub array<string>, dud map<string, int>, addr struct<city:string,state:string,pin:bigint>) row format delimited fields terminated by ',' collection items terminated by '$' map keys terminated by '#';

External table:-

The EXTERNAL keyword tells Hive the table is external and the LOCATION clause is required to tell Hive where the table is located. Because it’s external, Hive does not assume it owns the data. Therefore, dropping the table does not delete the data, although the metadata for the table will be deleted.


Query- create external table if not exists tab7 (id int, name string, sal bigint, sub array<string>, city string) row format delimited   fields terminated by ',' collection items terminated by '$' location ‘/home/training/Desktop’;

Friday, 7 August 2015

Databases in hive

The Hive concept of a database is essentially just a catalog or namespace of tables. However, they are very useful for larger clusters with multiple teams and users, as a way of avoiding table name collisions. It’s also common to use databases to organize production tables into logical groups.
If you don’t specify a database, the default database is used.

Listing the already existing databases in hive:-


Query- Show databases;








Creating a database:-

Query- Create database if not exists database1;






If not exists is an optional keyword used to check whether the database already exists or not.

To override the use of default database:-

Query- Use hive1;

To view the properties of a database:-

To view the location and other properties of a database, describe command can be used.

Query- Describe database database1;






Drop Database:-

Query- Drop database database1;

Hive will create a directory for each database. Tables in that database will be stored in
Subdirectories of the database directory.