Let us dive deep into big data hadoop!: 2016

Tuesday 4 October 2016

Hive User Defined Functions (Hive UDF’s)

Hive UDF also called as Hive streaming, is used when the Hive’s built in functions are insufficient. For example, in the need to extract the ordering of elements in an array is to write a custom Java UDF that would take a lot of time. So it is easy to create a script in another languages like Python, Perl and feed it into a hive query using the function TRANSFORM.

Input data:-

Userid movieid rating unixtime

6 86 3 883603013

7 32 4 891350932

8 338 4 879361873

Now create a hive table and load the input data into it as shown below -

hive > Create table udf_data (userid int, movieid int, rating int, unixtime string)

Row format delimited fields terminated by ‘\t’

Stored as textfile;

hive > Load data local inpath ‘/home/training/Desktop/udf1.txt’ into table udf_data;

Python UDF (udf1.py):-

Adding the Python file to the hdfs:-

hive > Add file udf1.py;

Resource added.

The following query returns the weekday of the unixtime by processing the above python UDF functions-

hive > SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python udf1.py' AS (userid, movieid, rating, weekday) FROM udf_data;

The following gives the output of the hive udf function-

6 86 3 3

7 32 4 2

8 338 4 3

Time_Taken: 20.894 seconds

Hive User Defined Functions (Hive UDF’s)

Input data:-

Userid movieid rating unixtime

6 86 3 883603013

7 32 4 891350932

8 338 4 879361873

Now create a hive table and load the input data into it as shown below -

hive > Create table udf_data (userid int, movieid int, rating int, unixtime string)

Row format delimited fields terminated by ‘\t’

Stored as textfile;

hive > Load data local inpath ‘/home/training/Desktop/udf1.txt’ into table udf_data;

Python UDF (udf1.py):-

import sys
import datetime
for line in sys.stdin:
line = line.strip().split('\t')
userid, movieid, rating, unixtime = line
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([userid, movieid, rating, str(weekday)])

Adding the Python file to the hdfs:-

hive > Add file udf1.py;

Resource added.

The following query returns the weekday of the unixtime by processing the above python UDF functions-

hive > SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python udf1.py' AS (userid, movieid, rating, weekday) FROM udf_data;

The following gives the output of the hive udf function-

Output of UDF:-

6 86 3 3

7 32 4 2

8 338 4 3

Time_Taken: 20.894 seconds

Thursday 25 August 2016

SQOOP Import to HBase

Sqoop can also import data into additional targets like HBase other than HDFS and Hive. The following arguments are used in sqoop to import data from RDBMS to HBase-

--hbase-table <table_name>

This argument instructs sqoop to import the data to the specified hbase table rather than importing into HDFS location. This argument can be used when the table in the hbase already exists or can be used to newly create a table along with the –hbase-create-table argument.

--hbase-row-key <column>

This argument instructs sqoop which column of RDBMS should be the Row Key in HBase.

--hbase-create-table

This argument creates a new table in hdfs if the table doesn’t exist.

sqoop import \

--connect jdbc:mysql://localhost/cts571687?user=root \

--table incremental_mysql \

--hbase-table hbase_newimport \

--hbase-row-key movieid \

--column-family metadata \

--hbase-create-table \

-m 1