Big Data Module 2 - HIVE QL Syntax Select statement

odesh · Posted 08-25-2019 11:39 AM

Hello,

Two (2) questions about the general HIVE QL SELECT statement ( please see attachment):

1. Why SELECT ALL ? Should this not be SELECT * ... ?

2. Is the usual ORDER BY now accounted for by A COMBINATION of 3 HIVE QL constructs ?

(CLUSTER BY, DISTRIBUTE BY AND SORT BY ) ?

Parsimony ?

Thanks.

Odesh.

JBailey · Posted 08-26-2019 02:33 PM

Hi @odesh

SELECT ALL - is the default. It means to return all rows. The alternative is specify DISTINCT (SELECT DISTINCT ...) which removes duplicates. You seldom encounter the ALL keyword.

SELECT ALL * --includes duplicate rows

   FROM mytable;

SELECT DISTINCT * -- removes duplicate rows

   FROM mytable

The syntax on the slide shows the Common Table Expression (CTE) statement. Since Hive, and Hadoop, are weird, you have more choices.

ORDER BY - if hive.mapred.mode=strict must include a LIMIT clause. If it doesn't, you will get an error.

SORT BY - similar to ORDER BY, sorts the rows before feeding it to the MapReduce reducers.

SORT BY vs. ORDER BY

CLUSTER BY and DISTRIBUTE BY - May as well read this in the doc.

As with all things Hive/Hadoop, there is nothing like practicing and suffering (unfortunately).

Best wishes,

Jeff

Big Data Module 2 - HIVE QL Syntax Select statement

Re: Big Data Module 2 - HIVE QL Syntax Select statement

SAS Training: Just a Click Away