Wednesday, 12 August 2015

Bucketing in Hive


The main difference between Hive partitioning and bucketing is, when we do partitioning, we create a partition for each unique value of the column. But there may be situation where we need to create lot of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory level and bucketing is a file level.

Prior to bucketing, it is important to set the following property:-

Set hive.enforce.bucketing = true;

Bucketing table can be created by using clustered by keyword. The following query creates a bucketed table in hive by clustering id1 as 5 buckets -

create table clus3 (id1 int, name string) 
partitioned by (dept string)
clustered by (id1) into 5 buckets
row format delimited
fields terminated by '\t';

The following query creates a managed partitioned table that holds temporary values prior to move to the bucketed tables.

Create table Partition2 (id1 int, name string, dept string) row format delimited fields terminated by '\t';

Now the data is loaded into the temporary table.

Load data local inpath '/home/training/Desktop/part1.txt' into table Partition2;

Now the data from temporary table is loaded into the bucketed table.


Insert overwrite table clus3 partition (dept) select id1, name, dept from partition2;

1 comment:

  1. hive bucketing
    Hive bucketing is a method for dividing the data into number of equal parts.
    CLUSTERED BY command is used While creating bucketing in hive.
    Hive bucketing concept can be performed on internal tables or External tables.
    Hive bucketing can perform only on one column to get best result.
    In hive command line, below command is used to enable buckets.

    ReplyDelete