BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
odesh
Quartz | Level 8

I am looking for a general but comprehensive list of differences between HIve and PIG. So far, my limited understanding is  as follows:

 

1. HIve ( and HIVE QL ) are  similar to RDBMSs like Oracle, Teradata, SQL SErver etc. ( datatypes , database operations , table operations etc.)

 

2. I am not sure whether the concept of schema in Hive ( or HiveQL ) is different from that of the usual RDMS . Is it  and if so, how ?

Is there a schema concept in PIG and the related PIG language ?

 

3. HiveQL in my mind sits on top of MapReduce and all HiveQL queries and compiled and optimized into one or more Map Reduce programs before these are run on thr HADOOP cluster.

 

4. What happens in a Mapping as opposed to a Reduce.  What operations would be translated into a Mapping or Reduce or both ?

 

5 Why was Java chosen as the language for writing the code that would finally be run on the HADOOP cluster ?

 

6. PIG is even more high level than HiveQL and is useful for applications that can be structured as a data-flow format. What would be real-life examples of applications that be suitable for PIG ( and Grunt) as oposed to HiveQL and Beeline ?  What would be examples where PIG would not be a good choice ?

 

Please tell me  where I understand things correctly and where I do not.

 

Thanks.

Odesh.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Cynthia_sas
Diamond | Level 26

Hi:

  Here's some feedback from the instructors about your questions.

Cynthia

 

1) HIve ( and HIVE QL ) are  similar to RDBMSs like Oracle, Teradata, SQL Server etc. ( datatypes , database operations , table operations etc.)

Answer: yes .. except HiveQL is not ANSI compliant

 

2) I am not sure whether the concept of schema in Hive ( or HiveQL ) is different from that of the usual RDMS . Is it  and if so, how ? Is there a schema concept in PIG and the related PIG language ?

Answer:  There is not a schema concept.  Users can defined relationships between tables, such as those found in a star or snowflake schema .. using primary and foreign keys.  Again,  HiveQL is not ansi, so all SQL function may not be available.

For Pig latin, the concept of a “schema” in the fields and their data types in an alias or relation. 

 

3) HiveQL in my mind sits on top of MapReduce and all HiveQL queries and compiled and optimized into one or more Map Reduce programs before these are run on the HADOOP cluster.

Answer:  that’s correct.  The same is true for Pig.

 

4) What happens in a Mapping as opposed to a Reduce.  What operations would be translated into a Mapping or Reduce or both ?

Answer:  a mapping is basically “copying” from A to B … Reduce is an aggregation or summation operation .. for instance.. using Sqoop to move data from a database to hadoop only uses a mapping operation.. just like a HiveQL select * statement.  However,  a HiveQL Group By statement would result in a reduce phase to summarize the results of a HiveQL query.

 

5)  Why was Java chosen as the language for writing the code that would finally be run on the HADOOP cluster ?

Answer: probably because Java was the most popular and extensible language available to Doug Cutting and Mike Caferalla in 2005.

 

6)  PIG is even more high level than HiveQL and is useful for applications that can be structured as a data-flow format. What would be real-life examples of applications that be suitable for PIG ( and Grunt) as opposed to HiveQL and Beeline?  What would be examples where PIG would not be a good choice ?

Answer:  basically .. HiveQL allows you to imply a structure on HDFS data .. and write SQL to manipulate the data.  Pig is direct access to the data, and requires less resources to work with data then Pig.    If I would working with unstructured data like IoT or social media, I would use Pig and some existing user-defined functions, like Page Rank Analysis to slice and dice the data.  Also,  Pig allows you to write UDFS in java and Python.  As we know, Python is very extensible.

View solution in original post

1 REPLY 1
Cynthia_sas
Diamond | Level 26

Hi:

  Here's some feedback from the instructors about your questions.

Cynthia

 

1) HIve ( and HIVE QL ) are  similar to RDBMSs like Oracle, Teradata, SQL Server etc. ( datatypes , database operations , table operations etc.)

Answer: yes .. except HiveQL is not ANSI compliant

 

2) I am not sure whether the concept of schema in Hive ( or HiveQL ) is different from that of the usual RDMS . Is it  and if so, how ? Is there a schema concept in PIG and the related PIG language ?

Answer:  There is not a schema concept.  Users can defined relationships between tables, such as those found in a star or snowflake schema .. using primary and foreign keys.  Again,  HiveQL is not ansi, so all SQL function may not be available.

For Pig latin, the concept of a “schema” in the fields and their data types in an alias or relation. 

 

3) HiveQL in my mind sits on top of MapReduce and all HiveQL queries and compiled and optimized into one or more Map Reduce programs before these are run on the HADOOP cluster.

Answer:  that’s correct.  The same is true for Pig.

 

4) What happens in a Mapping as opposed to a Reduce.  What operations would be translated into a Mapping or Reduce or both ?

Answer:  a mapping is basically “copying” from A to B … Reduce is an aggregation or summation operation .. for instance.. using Sqoop to move data from a database to hadoop only uses a mapping operation.. just like a HiveQL select * statement.  However,  a HiveQL Group By statement would result in a reduce phase to summarize the results of a HiveQL query.

 

5)  Why was Java chosen as the language for writing the code that would finally be run on the HADOOP cluster ?

Answer: probably because Java was the most popular and extensible language available to Doug Cutting and Mike Caferalla in 2005.

 

6)  PIG is even more high level than HiveQL and is useful for applications that can be structured as a data-flow format. What would be real-life examples of applications that be suitable for PIG ( and Grunt) as opposed to HiveQL and Beeline?  What would be examples where PIG would not be a good choice ?

Answer:  basically .. HiveQL allows you to imply a structure on HDFS data .. and write SQL to manipulate the data.  Pig is direct access to the data, and requires less resources to work with data then Pig.    If I would working with unstructured data like IoT or social media, I would use Pig and some existing user-defined functions, like Page Rank Analysis to slice and dice the data.  Also,  Pig allows you to write UDFS in java and Python.  As we know, Python is very extensible.

 

This is a knowledge-sharing community for learners in the Academy. Find answers to your questions or post here for a reply.
To ensure your success, use these getting-started resources:

Estimating Your Study Time
Reserving Software Lab Time
Most Commonly Asked Questions
Troubleshooting Your SAS-Hadoop Training Environment

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 1 reply
  • 1046 views
  • 1 like
  • 2 in conversation