Let us dive deep into big data hadoop!: 2015

Thursday, 1 October 2015

Error occurred executing hive query: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [hue] does not have [USE] privilege on [default]

The Solution to the following issue is:-

Error occurred executing hive query: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [hue] does not have [USE] privilege on [default]

Step 1:-

Change the file permission to 777 for the following files in conf and conf.server locations:-

/etc/hive/ conf/hive-site.xml file

/etc/hive / conf.server /hive-site.xml

/etc/hive/conf/hiveserver2-site.xml

/etc/hive/conf.server/hiveserver2-site.xml

For example: chmod 777 /etc/hive/ conf/hive-site.xml

Step 2:-

Change the property values for hive-site.xml files in locations conf and conf.server

Make sure the following property value is set to false:-

<name>hive.security.authorization.enabled</name>

<value>false</value>

</property>

Change the below property value to org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider:-

<name>hive.security.metastore.authorization.manager</name>

<value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider,org.apache.hadoop.hive.ql.security.authorization.MetaStoreAuthzAPIAuthorizerEmbedOnly</value>

</property>

For Example:-

<name>hive.security.metastore.authorization.manager</name>

<value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider</value>

</property>

Step 3:-

Change the property values for hiveserver2-site.xml files in locations conf and conf.server

Make sure the following property value is set to false:-

<name>hive.security.authorization.enabled</name>

<value>false</value>

</property>

Step 4:-

Change the file permission to 777 for the hdfs-site.xml file in conf location:-

For example: Chmod 777 /etc/hadoop/conf/hdfs-site.xml

Step 5:-

Change the property values for hdfs-site.xml files in conf location.

Make sure the following property value is set to false:-

<name>dfs.permissions.enabled</name>

<value>false</value>

</property>

Step 6:-

Now restart the hive services in ambari window: localhost: 8080.

Step 7:-

Now open the hue interface: localhost:8000 and click on Beeswax (Hive UI) in the menu.

Step 8:-

Execute the below query in the query editor:-

set admin role;

Now the result of the query will be displayed. Hope this works well.

--Good luck--

Sunday, 23 August 2015

Distribute by and Sort by clause in hive

DISTRIBUTE BY controls how map output is divided among reducers. By default, Map Reduce computes a hash on the keys output by mappers and tries to evenly distribute the key-value pairs among the available reducers using the hash values. Say we want the data for each value in a column to be captured together. We can use DISTRIBUTE BY to ensure that the records for each go to the same reducer. DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers receive rows for processing, Note that Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause if it's in same query .

For example, consider the following query without using sort by

Select t3.id, t3.name, t3.salary, t3.off_location from t3 distribute by t3.off_location;

Now, consider the query with sort by.

Select t3.id, t3.name, t3.salary, t3.off_location from t3 distribute by t3.off_location sort by t3.salary desc;

Friday, 14 August 2015

Joins In Hive

Creating a table:-

1. create table t1 (id int, name string, age int)row format delimited fields terminated by '\,';

2. create table t2 (id int, name string, age int)row format delimited fields terminated by '\,';

3. create table t3(id int, name string, age int, designation string, off_location string, salary double)row format delimited fields terminated by '\,';

Loading data into tables t1, t2 and t3:-

1. load data local inpath '/home/training/Desktop/t1.txt' into table t1;

2. load data local inpath '/home/training/Desktop/t2.txt' into table t2;

3. load data local inpath ‘/home/training/Desktop/t3.txt’ into table t3;

Retrieving all the rows from the tables:-

select * from t1;

select * from t2;

select * from t3;

Simple join:-

Hive supports only equality joins because it is primarily difficult to express non-equality joins in map-reduce.

Syntax: select table_name_1.ColumnValue, table_name_2.ColumnValue from table_name_1 JOIN table_name_2 on (table_name_1.ColumnValue=table_name_2.ColumnValue)

select t1.id,t1.name,t1.age,t2.name,t2.age from t1 JOIN t2 ON(t1.id=t2.id);

select t1.id,t1.name,t2.age from t1 JOIN t2 ON(t1.id=t2.id);

Select t3.id,t3.designation,t2.name,t1.age from t3 join t2 on(t3.id=t2.id) join t1 on(t3.id=t1.id);

Inner join:

Inner joins are commutative; however, should the user prefer having t1 as the left table and t2 as the right table or vice-versa? The answer to this question lies in understanding how Hive compiles a join query to MapReduce. In the MapReduce job for regular inner joins, mappers run on both tables, emitting out records from that need to be joined by evaluating any UDFs in the query and filtering out any records based on the where clause. Then the shuffle phase is run which “shuffles” the keys based on the join key (id in above example). Subsequently, in the reduce phase, essentially a cross-product takes place between records from each table that have the same join key. During this reduce phase, records from all tables but the last table are buffered in memory while the records from query’s last table are streamed into the reducer. Consequently, in order to reduce the memory required by reducers, it’s recommended to put the largest tables last to ensure their records are streamed into the reducers.

Select t2.name, t3.designation, t3.off_location from t2 JOIN t3 where t2.id=t3.id AND t2.name=t3.name;

Outer Join:

The table in an Outer Join that must return all rows. For left outer joins this is the Left table, for right outer joins it is the Right table, and for full outer joins both tables are Preserved Row tables.

Left outer join:-

Select t3.off_location, t3.id, t1.id from t3 left outer join t1 on t3.id=t1.id;

Right outer join:-

Select t3.off_location, t3.id, t1.id from t3 right outer join t1 on t3.id=t1.id;

Full Outer Join

Select t3.off_location, t3.id, t1.id from t3 full outer join t1 on t3.id=t1.id;

Lateral view explode() in hive

Lateral view is used in conjunction with user-defined table generating functions such as

explode
()

. A UDTF generates zero or more output rows for each input row. A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.

Input file:-

Column1 column2

One [1, 2, 3]

Two [3, 4, 5]

Create table lateral1 (column1 string, column2 array<int>) row format delimited collection items terminated by ‘,’ stored as textfile;

Now exploding the list of elements in an array using lateral view explode () functions:-

SELECT column1, columns

FROM lateral1 LATERAL VIEW explode (column2) lateralTable AS column;

The output result will be:-

One 1

One 2

One 3

Two 3

Two 4

Two 5

Wednesday, 12 August 2015

Views and Indexes in hive

Views:-

Views are a logical construct that can be used to simplify queries by either abstracting away complexities such as joins or sub-queries or by pre-filtered or restricting data.

Input file:-

1, San Jose, Men’s Clothing, 214.05, Amex

2, Fort Worth, Women’s Clothing, 153.57, Visa

3, San Diego, Music, 66.08, Cash

4, Pittsburgh, Pet Supplies, 493.51, Discover

5, New York, Consumer Electronics, 296.8, Cash

6, Corpus Christi, Toys, 25.38, Discover

7, Fort Worth, Toys, 213.88, Visa

8, Greensboro, DVDs, 290.82, MasterCard

9, Riverside, Consumer Electronics, 252.88, Cash

10,Reno,Crafts,88.25,Visa

The following query creates a managed table for which the view is to be applied -

Create table view_sample (id int, city string, particular string, amount float, cardtype string)

Row format delimited fields terminated by ','

Stored as textfile;

The following query loads data into table view_sample-

Load data local inpath 'views' overwrite into table view_sample;

The below query creates a view of all data for the amount greater than 100 -

Create view view1 as select * from view_sample where amount>100;

Now retrieving the rows from the created view-

Select * from view1;

Output of the view:-

1 San Jose Men's Clothing 214.05 Amex

2 Fort Worth Women's Clothing 153.57 Visa

4 Pittsburgh Pet Supplies 493.51 Discover

5 New York Consumer Electronics 296.8 Cash

7 Fort Worth Toys 213.88 Visa

8 Greensboro DVDs 290.82 MasterCard

9 Riverside Consumer Electronics 252.88 Cash

Time taken: 19.958 seconds

Dropping a view:-

Drop view view1;

Index:-

Indexes are a critical for efficiently retrieving rows from tables in less time. The following query creates an index for a table column amount-

Create index index_amount on table view_sample (amount) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild;

WITH DEFERRED REBUILD portion of the command prevents the index from immediately being built.

Now retrieving the rows of a particular column for which the index is been built -

Select amount from view_sample;

Output of the indexed column:-

214.05

153.57

66.08

493.51

296.8

25.38

213.88

290.82

252.88

88.25

Time taken: 16.026 seconds

Dropping an index:-

Drop index index_amount on view_sample;

Alter table Statements in hive

Hive allows to add additional table properties or modify existing properties of a table using alter statements, but not to remove them.

Renaming a table column:-

The following alter query renames the column name and its datatype-

Alter table clus3 change column id1 id_1 int after name;

Modifying existing table storage properties:-

The following query will modify the properties of array:-

Alter table clus2 set SERDEPROPERTIES ('colection.delim'= '=');

The following query will modify the properties of row fields:-

Alter table clus2 set SERDEPROPERTIES ('field.delim'= ':');

The following query will modify the properties of Map datatype:-

Alter table clus2 set SERDEPROPERTIES ('mapkey.delim'= '|');

Bucketing in Hive

The main difference between Hive partitioning and bucketing is, when we do partitioning, we create a partition for each unique value of the column. But there may be situation where we need to create lot of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory level and bucketing is a file level.

Prior to bucketing, it is important to set the following property:-

Set hive.enforce.bucketing = true;

Bucketing table can be created by using clustered by keyword. The following query creates a bucketed table in hive by clustering id1 as 5 buckets -

create table clus3 (id1 int, name string)

partitioned by (dept string)

clustered by (id1) into 5 buckets

row format delimited

fields terminated by '\t';

The following query creates a managed partitioned table that holds temporary values prior to move to the bucketed tables.

Create table Partition2 (id1 int, name string, dept string) row format delimited fields terminated by '\t';

Now the data is loaded into the temporary table.

Load data local inpath '/home/training/Desktop/part1.txt' into table Partition2;

Now the data from temporary table is loaded into the bucketed table.

Insert overwrite table clus3 partition (dept) select id1, name, dept from partition2;

Partitioning in Hive

Partitioning of data is often used for distributing load horizontally, moving data physically closer to its most frequent users, and other purposes. Hive has the notion of partitioned tables for producing performance benefits, and they can help organize data in a logical fashion, such as hierarchically.

Query- Create table Partition2 (id1 int, name string)

Partitioned by (dept string)

Row format delimited

Fields terminated by '\t';

You can view the partitions of a table by using SHOW PARTITIONS command:

Query- Show partitions Partition2;

You can also view the number of partitions created in HDFS by typing

“localhost.localdomain:50075” in a web browser.

Loading data into partitioned table:-

Loading the data into the partitioned table can be of static or dynamic depending upon the data you load into the table.

Static Partition:-

Static partition requires individual loads of data to be loaded into the table for each partitions.

Query -load data local inpath '/home/training/Desktop/part1.txt' into table Partition1 partition (dept="cse");

Dynamic Partition:-

Loading data into dynamic partitions allows single insert to the partitioned table. The following properties should be enabled while performing the dynamic partitions

Query - set hive.exec.dynamic.partiton=true;

set hive.exec.dynamic.partiton.mode=nonstrict;

set hive.exec.max.dynamic.partiton.pernode=1000;

load data local inpath '/home/training/Desktop/part1.txt' into table Partition1 partition (dept);

To view the data from a particular partition:-

Query -Show partitions clus2 partition (dept="cse");

It is also possible to add a new partition to the existing tables.

Query - Alter table clus2 add if not exists partition (year=2015);

To drop a partition from a table:-

Query - Alter table clus2 drop partition (dept=”cse”);

Loading data into Hive tables

Since Hive has no row-level insert, update, and delete operations, the only way to put data into a hive table is to use “bulk” load operations. The following query can be used to load data from local file system to the hive table stored in hdfs.

Query- Load data local inpath ‘/home/training/src.txt’ into table tab11;

When it’s necessary to load the input file from an hdfs location

Query- Load data inpath ‘src.txt’ into table tab11;

Overwrite command can be used when the hive table already holds data and to replace the existing values

Query- Load data local inpath ‘/home/training/src.txt’ overwrite into table tab11;

Storing an output:-

The following query stores the output in local directory by creating a folder “output” in the specified path.

Query- Insert overwrite local directory ‘/home/training/output’ select * from tab11;

The following query stores the output of query in hdfs.

Query- Insert overwrite directory ‘/output’ select * from tab11.

Monday, 10 August 2015

Creating a table in hive

Creating a Table:-

Managed table:-

The managed tables are sometimes called as internal tables, because Hive controls the lifecycle of their data (more or less). Hive stores the data for these tables in a subdirectory under the directory defined by hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default. When we drop a managed table, Hive deletes the entire data in the table. However, managed tables are less convenient for sharing with other tools.

Query- create table if not exists tab11 (id int, name string, sal bigint, sub array<string>, dud map<string, int>, addr struct<city:string,state:string,pin:bigint>) row format delimited fields terminated by ',' collection items terminated by '$' map keys terminated by '#';

External table:-

The EXTERNAL keyword tells Hive the table is external and the LOCATION clause is required to tell Hive where the table is located. Because it’s external, Hive does not assume it owns the data. Therefore, dropping the table does not delete the data, although the metadata for the table will be deleted.

Query- create external table if not exists tab7 (id int, name string, sal bigint, sub array<string>, city string) row format delimited fields terminated by ',' collection items terminated by '$' location ‘/home/training/Desktop’;