BookmarkSubscribeRSS Feed
🔒 This topic is locked. We are no longer accepting replies to this topic. Need further help? Please sign in and ask a new question.
odesh
Quartz | Level 8

Hello,

Two (2) questions about the general HIVE QL  SELECT statement ( please see attachment):

 

1. Why SELECT ALL ?  Should this not be SELECT * ...  ?

 

2. Is the usual ORDER BY  now accounted for by A COMBINATION of 3 HIVE QL constructs ?

     (CLUSTER BY, DISTRIBUTE BY AND SORT BY ) ?

   Parsimony ?

 

Thanks.

Odesh.

1 REPLY 1
JBailey
Barite | Level 11

Hi @odesh 

 

SELECT ALL - is the default. It means to return all rows. The alternative is specify DISTINCT (SELECT DISTINCT ...) which removes duplicates. You seldom encounter the ALL keyword. 

 

SELECT ALL * --includes duplicate rows

   FROM mytable;

SELECT DISTINCT * -- removes duplicate rows

   FROM mytable

 

The syntax on the slide shows the Common Table Expression (CTE) statement. Since Hive, and Hadoop, are weird, you have more choices. 

 

ORDER BY - if hive.mapred.mode=strict must include a LIMIT clause. If it doesn't, you will get an error. 

 

SORT BY - similar to ORDER BY,  sorts the rows before feeding it to the MapReduce reducers.

 

SORT BY vs. ORDER BY

 

CLUSTER BY and DISTRIBUTE BYMay as well read this in the doc.

 

As with all things Hive/Hadoop, there is nothing like practicing and suffering (unfortunately).

 

Best wishes,

Jeff

 

This is a knowledge-sharing community for learners in the Academy. Find answers to your questions or post here for a reply.
To ensure your success, use these getting-started resources:

Estimating Your Study Time
Reserving Software Lab Time
Most Commonly Asked Questions
Troubleshooting Your SAS-Hadoop Training Environment

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 1 reply
  • 795 views
  • 0 likes
  • 2 in conversation