Hive UDF also called as Hive
streaming, is used when the Hive’s built in functions are insufficient. For
example, in the need to extract the ordering of elements in an array is to
write a custom Java UDF that would take a lot of time. So it is easy to create
a script in another languages like Python, Perl and feed it into a hive query
using the function TRANSFORM.
Input data:-
Userid movieid rating unixtime
6 86 3 883603013
7 32 4 891350932
8 338 4 879361873
Now create a hive table and
load the input data into it as shown below -
hive > Create table
udf_data (userid int, movieid int, rating int, unixtime string)
Row format
delimited fields terminated by ‘\t’
Stored as
textfile;
hive > Load data local inpath
‘/home/training/Desktop/udf1.txt’ into table udf_data;
Python
UDF (udf1.py):-
import sys
import datetime
for line in sys.stdin:
line = line.strip().split('\t')
userid, movieid, rating, unixtime = line
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([userid, movieid, rating, str(weekday)])
import sys
import datetime
for line in sys.stdin:
line = line.strip().split('\t')
userid, movieid, rating, unixtime = line
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([userid, movieid, rating, str(weekday)])
Adding the Python file to the hdfs:-
hive > Add
file udf1.py;
Resource added.
The following query returns
the weekday of the unixtime by processing the above python UDF functions-
hive > SELECT TRANSFORM (userid, movieid, rating,
unixtime) USING 'python udf1.py' AS (userid, movieid, rating, weekday) FROM
udf_data;
Output of UDF:-
6 86 3 3
7 32 4 2
8 338 4 3
Time_Taken: 20.894 seconds
It was really a nice article and i was really impressed by reading this Big data hadoop online training India
ReplyDelete