BookmarkSubscribeRSS Feed
🔒 This topic is locked. We are no longer accepting replies to this topic. Need further help? Please sign in and ask a new question.
odesh
Quartz | Level 8

Hello,

Two (2) questions about the general HIVE QL  SELECT statement ( please see attachment):

 

1. Why SELECT ALL ?  Should this not be SELECT * ...  ?

 

2. Is the usual ORDER BY  now accounted for by A COMBINATION of 3 HIVE QL constructs ?

     (CLUSTER BY, DISTRIBUTE BY AND SORT BY ) ?

   Parsimony ?

 

Thanks.

Odesh.

1 REPLY 1
JBailey
Barite | Level 11

Hi @odesh 

 

SELECT ALL - is the default. It means to return all rows. The alternative is specify DISTINCT (SELECT DISTINCT ...) which removes duplicates. You seldom encounter the ALL keyword. 

 

SELECT ALL * --includes duplicate rows

   FROM mytable;

SELECT DISTINCT * -- removes duplicate rows

   FROM mytable

 

The syntax on the slide shows the Common Table Expression (CTE) statement. Since Hive, and Hadoop, are weird, you have more choices. 

 

ORDER BY - if hive.mapred.mode=strict must include a LIMIT clause. If it doesn't, you will get an error. 

 

SORT BY - similar to ORDER BY,  sorts the rows before feeding it to the MapReduce reducers.

 

SORT BY vs. ORDER BY

 

CLUSTER BY and DISTRIBUTE BYMay as well read this in the doc.

 

As with all things Hive/Hadoop, there is nothing like practicing and suffering (unfortunately).

 

Best wishes,

Jeff