Tuesday, 4 October 2016

Hive User Defined Functions (Hive UDF’s)

Hive UDF also called as Hive streaming, is used when the Hive’s built in functions are insufficient. For example, in the need to extract the ordering of elements in an array is to write a custom Java UDF that would take a lot of time. So it is easy to create a script in another languages like Python, Perl and feed it into a hive query using the function TRANSFORM.

Input data:-
Userid              movieid           rating              unixtime
6                      86                    3                      883603013
7                      32                    4                      891350932
8                      338                  4                      879361873

Now create a hive table and load the input data into it as shown below -

hive > Create table udf_data (userid int, movieid int, rating int, unixtime string)
Row format delimited fields terminated by ‘\t’
Stored as textfile;

hive > Load data local inpath ‘/home/training/Desktop/udf1.txt’ into table udf_data;

Python UDF (udf1.py):-
Adding the Python file to the hdfs:-
hive > Add file udf1.py;
Resource added.

The following query returns the weekday of the unixtime by processing the above python UDF functions-
hive > SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python udf1.py' AS (userid, movieid, rating, weekday) FROM udf_data;

The following gives the output of the hive udf function-

6         86        3          3
7         32        4          2
8         338      4          3

Time_Taken: 20.894 seconds

Hive User Defined Functions (Hive UDF’s)

Hive UDF also called as Hive streaming, is used when the Hive’s built in functions are insufficient. For example, in the need to extract the ordering of elements in an array is to write a custom Java UDF that would take a lot of time. So it is easy to create a script in another languages like Python, Perl and feed it into a hive query using the function TRANSFORM.

Input data:-
Userid              movieid           rating              unixtime
6                      86                    3                      883603013
7                      32                    4                      891350932
8                      338                  4                      879361873

Now create a hive table and load the input data into it as shown below -

hive > Create table udf_data (userid int, movieid int, rating int, unixtime string)
Row format delimited fields terminated by ‘\t’
Stored as textfile;

hive > Load data local inpath ‘/home/training/Desktop/udf1.txt’ into table udf_data;

Python UDF (udf1.py):-

import sys
import datetime
for line in sys.stdin:
    line = line.strip().split('\t')
    userid, movieid, rating, unixtime = line
    weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
    print '\t'.join([userid, movieid, rating, str(weekday)])

Adding the Python file to the hdfs:-
hive > Add file udf1.py;
Resource added.

The following query returns the weekday of the unixtime by processing the above python UDF functions-
hive > SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python udf1.py' AS (userid, movieid, rating, weekday) FROM udf_data;

The following gives the output of the hive udf function-

Output of UDF:-

6         86        3          3
7         32        4          2
8         338      4          3

Time_Taken: 20.894 seconds

Thursday, 25 August 2016

SQOOP Import to HBase

Sqoop can also import data into additional targets like HBase other than HDFS and Hive. The following arguments are used in sqoop to import data from RDBMS to HBase-

--hbase-table <table_name>
                This argument instructs sqoop to import the data to the specified hbase table rather than importing into HDFS location. This argument can be used when the table in the hbase already exists or can be used to newly create a table along with the –hbase-create-table argument.

--hbase-row-key <column>
                This argument instructs sqoop which column of RDBMS should be the Row Key in HBase.

--hbase-create-table
                This argument creates a new table in hdfs if the table doesn’t exist.

sqoop import \
--connect jdbc:mysql://localhost/cts571687?user=root \
--table incremental_mysql \
--hbase-table hbase_newimport \
--hbase-row-key movieid \
--column-family metadata \
--hbase-create-table \
-m 1

Thursday, 1 October 2015

Error occurred executing hive query: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [hue] does not have [USE] privilege on [default]

The Solution to the following issue is:-

Error occurred executing hive query: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [hue] does not have [USE] privilege on [default]

Step 1:-

Change the file permission to 777 for the following files in conf and conf.server locations:-
/etc/hive/ conf/hive-site.xml file 
/etc/hive / conf.server /hive-site.xml
/etc/hive/conf/hiveserver2-site.xml  
/etc/hive/conf.server/hiveserver2-site.xml
For example: chmod 777 /etc/hive/ conf/hive-site.xml

Step 2:-

Change the property values for hive-site.xml files in locations conf and conf.server
Make sure the following property value is set to false:-
   <property>
     <name>hive.security.authorization.enabled</name>
     <value>false</value>
   </property>

Change the below property value to org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider:-
<property>
<name>hive.security.metastore.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider,org.apache.hadoop.hive.ql.security.authorization.MetaStoreAuthzAPIAuthorizerEmbedOnly</value>
</property>

For Example:-
   <property>
   <name>hive.security.metastore.authorization.manager</name>
   <value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider</value>
   </property>

Step 3:-

Change the property values for hiveserver2-site.xml files in locations conf and conf.server
Make sure the following property value is set to false:-
   <property>
       <name>hive.security.authorization.enabled</name>
       <value>false</value>
   </property>

Step 4:-

Change the file permission to 777 for the hdfs-site.xml file in conf location:-

For example: Chmod 777 /etc/hadoop/conf/hdfs-site.xml

Step 5:-

Change the property values for hdfs-site.xml files in conf location.
Make sure the following property value is set to false:-

<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>

Step 6:-

Now restart the hive services in ambari window: localhost: 8080.

Step 7:-

Now open the hue interface: localhost:8000 and click on Beeswax (Hive UI) in the menu.

Step 8:-

Execute the below query in the query editor:-
set admin role;

Now the result of the query will be displayed. Hope this works well.


--Good luck--

Sunday, 23 August 2015

Distribute by and Sort by clause in hive

DISTRIBUTE BY controls how map output is divided among reducers. By default, Map Reduce computes a hash on the keys output by mappers and tries to evenly distribute the key-value pairs among the available reducers using the hash values. Say we want the data for each value in a column to be captured together. We can use DISTRIBUTE BY to ensure that the records for each go to the same reducer. DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers receive rows for processing, Note that Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause if it's in same query .

For example, consider the following query without using sort by


Select t3.id, t3.name, t3.salary, t3.off_location from t3 distribute by t3.off_location;









Now, consider the query with sort by.

Select t3.id, t3.name, t3.salary, t3.off_location from t3 distribute by t3.off_location sort by t3.salary desc;


Friday, 14 August 2015

Joins In Hive


Creating a table:-

1.      create table t1 (id int, name string, age int)row format delimited fields terminated by '\,';
2.      create table t2 (id int, name string, age int)row format delimited fields terminated by '\,';
3.      create table t3(id int, name string, age int, designation string, off_location string, salary double)row format delimited fields terminated by '\,';

Loading data into tables t1, t2 and t3:-

1.      load data local inpath '/home/training/Desktop/t1.txt' into table t1;
2.      load data local inpath '/home/training/Desktop/t2.txt' into table t2;
3.      load data local inpath ‘/home/training/Desktop/t3.txt’ into table t3;

Retrieving all the rows from the tables:-

select * from t1;













select * from t2;











select * from t3;









Simple join:-

Hive supports only equality joins because it is primarily difficult to express non-equality joins in map-reduce.
Syntax: select table_name_1.ColumnValue, table_name_2.ColumnValue from table_name_1 JOIN table_name_2 on (table_name_1.ColumnValue=table_name_2.ColumnValue)

select t1.id,t1.name,t1.age,t2.name,t2.age from t1 JOIN t2 ON(t1.id=t2.id);






select t1.id,t1.name,t2.age from t1 JOIN t2 ON(t1.id=t2.id);







Select t3.id,t3.designation,t2.name,t1.age from t3 join t2 on(t3.id=t2.id) join t1 on(t3.id=t1.id);






Inner join:

Inner joins are commutative; however, should the user prefer having t1 as the left table and t2 as the right table or vice-versa? The answer to this question lies in understanding how Hive compiles a join query to MapReduce. In the MapReduce job for regular inner joins, mappers run on both tables, emitting out records from that need to be joined by evaluating any UDFs in the query and filtering out any records based on the where clause. Then the shuffle phase is run which “shuffles” the keys based on the join key (id in above example). Subsequently, in the reduce phase, essentially a cross-product takes place between records from each table that have the same join key. During this reduce phase, records from all tables but the last table are buffered in memory while the records from query’s last table are streamed into the reducer. Consequently, in order to reduce the memory required by reducers, it’s recommended to put the largest tables last to ensure their records are streamed into the reducers.

Select t2.name, t3.designation, t3.off_location from t2 JOIN t3 where t2.id=t3.id AND t2.name=t3.name;







Outer Join:
The table in an Outer Join that must return all rows. For left outer joins this is the Left table, for right outer joins it is the Right table, and for full outer joins both tables are Preserved Row tables.
Left outer join:-
Select t3.off_location, t3.id, t1.id from t3 left outer join t1 on t3.id=t1.id;









Right outer join:-
Select t3.off_location, t3.id, t1.id from t3 right outer join t1 on t3.id=t1.id;







Full Outer Join
Select t3.off_location, t3.id, t1.id from t3 full outer join t1 on t3.id=t1.id;













Lateral view explode() in hive


Lateral view is used in conjunction with user-defined table generating functions such as explode (). A UDTF generates zero or more output rows for each input row. A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.

Input file:-

Column1          column2
One                 [1, 2, 3]
Two                 [3, 4, 5]

Create table lateral1 (column1 string, column2 array<int>) row format delimited collection items terminated by ‘,’ stored as textfile;

Now exploding the list of elements in an array using lateral view explode () functions:-

SELECT column1, columns
FROM lateral1 LATERAL VIEW explode (column2) lateralTable AS column;

The output result will be:-

One     1
One     2
One     3
Two     3
Two     4

Two     5