Let us dive deep into big data hadoop!: Distribute by and Sort by clause in hive

Sunday, 23 August 2015

Distribute by and Sort by clause in hive

DISTRIBUTE BY controls how map output is divided among reducers. By default, Map Reduce computes a hash on the keys output by mappers and tries to evenly distribute the key-value pairs among the available reducers using the hash values. Say we want the data for each value in a column to be captured together. We can use DISTRIBUTE BY to ensure that the records for each go to the same reducer. DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers receive rows for processing, Note that Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause if it's in same query .

For example, consider the following query without using sort by

Select t3.id, t3.name, t3.salary, t3.off_location from t3 distribute by t3.off_location;

Now, consider the query with sort by.

Select t3.id, t3.name, t3.salary, t3.off_location from t3 distribute by t3.off_location sort by t3.salary desc;

1 comment:

Tejuteju6 July 2018 at 06:32
Thank you.Well it was nice post and very helpful information on Big data hadoop online training Hyderabad
ReplyDelete
Replies

Add comment