The main
difference between Hive partitioning and bucketing is, when we do partitioning,
we create a partition for each unique value of the column. But there may be
situation where we need to create lot of tiny partitions. But if you use
bucketing, you can limit it to a number which you choose and decompose your
data into those buckets. In hive a partition is a directory level and bucketing
is a file level.
Prior to
bucketing, it is important to set the following property:-
Set
hive.enforce.bucketing = true;
Bucketing
table can be created by using clustered
by keyword. The following query creates a bucketed table in hive by
clustering id1 as 5 buckets -
create table clus3 (id1
int, name string)
partitioned by (dept
string)
clustered by (id1) into
5 buckets
row format delimited
fields terminated by
'\t';
The
following query creates a managed partitioned table that holds temporary values
prior to move to the bucketed tables.
Create table Partition2
(id1 int, name string, dept string) row format delimited fields terminated by
'\t';
Now the data
is loaded into the temporary table.
Load data local inpath
'/home/training/Desktop/part1.txt' into table Partition2;
Now the data
from temporary table is loaded into the bucketed table.
Insert overwrite table
clus3 partition (dept) select id1, name, dept from partition2;
hive bucketing
ReplyDeleteHive bucketing is a method for dividing the data into number of equal parts.
CLUSTERED BY command is used While creating bucketing in hive.
Hive bucketing concept can be performed on internal tables or External tables.
Hive bucketing can perform only on one column to get best result.
In hive command line, below command is used to enable buckets.