Tuesday 4 October 2016

Hive User Defined Functions (Hive UDF’s)

Hive UDF also called as Hive streaming, is used when the Hive’s built in functions are insufficient. For example, in the need to extract the ordering of elements in an array is to write a custom Java UDF that would take a lot of time. So it is easy to create a script in another languages like Python, Perl and feed it into a hive query using the function TRANSFORM.

Input data:-
Userid              movieid           rating              unixtime
6                      86                    3                      883603013
7                      32                    4                      891350932
8                      338                  4                      879361873

Now create a hive table and load the input data into it as shown below -

hive > Create table udf_data (userid int, movieid int, rating int, unixtime string)
Row format delimited fields terminated by ‘\t’
Stored as textfile;

hive > Load data local inpath ‘/home/training/Desktop/udf1.txt’ into table udf_data;

Python UDF (udf1.py):-
Adding the Python file to the hdfs:-
hive > Add file udf1.py;
Resource added.

The following query returns the weekday of the unixtime by processing the above python UDF functions-
hive > SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python udf1.py' AS (userid, movieid, rating, weekday) FROM udf_data;

The following gives the output of the hive udf function-

6         86        3          3
7         32        4          2
8         338      4          3

Time_Taken: 20.894 seconds

Hive User Defined Functions (Hive UDF’s)

Hive UDF also called as Hive streaming, is used when the Hive’s built in functions are insufficient. For example, in the need to extract the ordering of elements in an array is to write a custom Java UDF that would take a lot of time. So it is easy to create a script in another languages like Python, Perl and feed it into a hive query using the function TRANSFORM.

Input data:-
Userid              movieid           rating              unixtime
6                      86                    3                      883603013
7                      32                    4                      891350932
8                      338                  4                      879361873

Now create a hive table and load the input data into it as shown below -

hive > Create table udf_data (userid int, movieid int, rating int, unixtime string)
Row format delimited fields terminated by ‘\t’
Stored as textfile;

hive > Load data local inpath ‘/home/training/Desktop/udf1.txt’ into table udf_data;

Python UDF (udf1.py):-

import sys
import datetime
for line in sys.stdin:
    line = line.strip().split('\t')
    userid, movieid, rating, unixtime = line
    weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
    print '\t'.join([userid, movieid, rating, str(weekday)])

Adding the Python file to the hdfs:-
hive > Add file udf1.py;
Resource added.

The following query returns the weekday of the unixtime by processing the above python UDF functions-
hive > SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python udf1.py' AS (userid, movieid, rating, weekday) FROM udf_data;

The following gives the output of the hive udf function-

Output of UDF:-

6         86        3          3
7         32        4          2
8         338      4          3

Time_Taken: 20.894 seconds