topic Re: Big Data Module 2 MapReducers in SAS Academy for Data Science

Big Data Module 2 MapReducers

odesh — Sat, 21 Sep 2019 23:36:23 GMT

Hello,

Please refer to the attached question.

I am not sure what the suggested answer means by "SORT BY provides reducer level sorting instead of job level sorting".

I know what ORDER BY means in the context of PROC SQL and DBMS's in general.

Thanks.

Odesh.

Re: Big Data Module 2 MapReducers

Patrick — Sun, 22 Sep 2019 01:39:46 GMT

This is not SAS but Hive SQL syntax and you would need to ask such a question in a Hadoop/Hive forum. But Googling a bit here what's documented.

Difference between Sort By and Order By
Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

As far as I understand things Hive SQL gets translated into MapReduce for execution. It appears that Hive Sort By and Order By will result in different MapReduce code logic.

Understanding the mapper and reducer in a HIVE database

Re: Big Data Module 2 MapReducers

odesh — Mon, 23 Sep 2019 17:32:32 GMT

Thanks Patrick. Yes Google does access the general SAS documentation. I was
hoping for a more narrow focussed answer.

But thanks again .
Odesh.

Re: Big Data Module 2 MapReducers

Patrick — Mon, 23 Sep 2019 20:59:14 GMT

@odesh

Not sure how the answer could be narrower. Hive Sort By results in a sort of rows within a reducer. If you've got more than one reducer then the data isn't sorted over the whole file but only within the chunks per reducer.

Re: Big Data Module 2 MapReducers

odesh — Tue, 24 Sep 2019 12:30:32 GMT

JUst making sure that I understand the difference between ORDER BY and SORT
BY in HIveQL:

1. ORDER BY sorts the entire result set ( which can be be very resource
intensive with a large result set)
2. SORT BY sorts within each reducer which should be more efficient in
terms of processing time.

Am I correct ?

Thanks.
Odesh.

Re: Big Data Module 2 MapReducers

Patrick — Tue, 24 Sep 2019 21:46:08 GMT

@odesh wrote:
JUst making sure that I understand the difference between ORDER BY and SORT
BY in HIveQL:

1. ORDER BY sorts the entire result set ( which can be be very resource
intensive with a large result set)
2. SORT BY sorts within each reducer which should be more efficient in
terms of processing time.

Am I correct ?

Thanks.
Odesh.

@odesh

Yes, that's how I understand what's explained under the links I've posted earlier.