BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
odesh
Quartz | Level 8

Hello,

Please refer to the attached question. 

I am not sure what the suggested answer means by "SORT BY provides reducer level sorting instead of job level sorting".

 

I know what ORDER BY means in the context of PROC SQL and DBMS's in general.

 

Thanks.

Odesh.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@odesh wrote:
JUst making sure that I understand the difference between ORDER BY and SORT
BY in HIveQL:

1. ORDER BY sorts the entire result set ( which can be be very resource
intensive with a large result set)
2. SORT BY sorts within each reducer which should be more efficient in
terms of processing time.

Am I correct ?

Thanks.
Odesh.

 

@odesh 

Yes, that's how I understand what's explained under the links I've posted earlier.

View solution in original post

5 REPLIES 5
Patrick
Opal | Level 21

This is not SAS but Hive SQL syntax and you would need to ask such a question in a Hadoop/Hive forum. But Googling a bit here what's documented.

 

Difference between Sort By and Order By
Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

 

As far as I understand things Hive SQL gets translated into MapReduce for execution. It appears that Hive Sort By and Order By will result in different MapReduce code logic.

Understanding the mapper and reducer in a HIVE database 

odesh
Quartz | Level 8
Thanks Patrick. Yes Google does access the general SAS documentation. I was
hoping for a more narrow focussed answer.

But thanks again .
Odesh.
Patrick
Opal | Level 21

@odesh 

Not sure how the answer could be narrower. Hive Sort By results in a sort of rows within a reducer. If you've got more than one reducer then the data isn't sorted over the whole file but only within the chunks per reducer. 

odesh
Quartz | Level 8
JUst making sure that I understand the difference between ORDER BY and SORT
BY in HIveQL:

1. ORDER BY sorts the entire result set ( which can be be very resource
intensive with a large result set)
2. SORT BY sorts within each reducer which should be more efficient in
terms of processing time.

Am I correct ?

Thanks.
Odesh.
Patrick
Opal | Level 21

@odesh wrote:
JUst making sure that I understand the difference between ORDER BY and SORT
BY in HIveQL:

1. ORDER BY sorts the entire result set ( which can be be very resource
intensive with a large result set)
2. SORT BY sorts within each reducer which should be more efficient in
terms of processing time.

Am I correct ?

Thanks.
Odesh.

 

@odesh 

Yes, that's how I understand what's explained under the links I've posted earlier.