Solved: BIg Data - Module 2: Need more detail and clarity on Hive vs PIG.

odesh · Posted 08-23-2019 02:43 PM

I am looking for a general but comprehensive list of differences between HIve and PIG. So far, my limited understanding is as follows:

1. HIve ( and HIVE QL ) are similar to RDBMSs like Oracle, Teradata, SQL SErver etc. ( datatypes , database operations , table operations etc.)

2. I am not sure whether the concept of schema in Hive ( or HiveQL ) is different from that of the usual RDMS . Is it and if so, how ?

Is there a schema concept in PIG and the related PIG language ?

3. HiveQL in my mind sits on top of MapReduce and all HiveQL queries and compiled and optimized into one or more Map Reduce programs before these are run on thr HADOOP cluster.

4. What happens in a Mapping as opposed to a Reduce. What operations would be translated into a Mapping or Reduce or both ?

5 Why was Java chosen as the language for writing the code that would finally be run on the HADOOP cluster ?

6. PIG is even more high level than HiveQL and is useful for applications that can be structured as a data-flow format. What would be real-life examples of applications that be suitable for PIG ( and Grunt) as oposed to HiveQL and Beeline ? What would be examples where PIG would not be a good choice ?

Please tell me where I understand things correctly and where I do not.

Thanks.

Odesh.

Cynthia_sas · Posted 08-26-2019 03:32 PM

Hi:

Here's some feedback from the instructors about your questions.

Cynthia

1) HIve ( and HIVE QL ) are similar to RDBMSs like Oracle, Teradata, SQL Server etc. ( datatypes , database operations , table operations etc.)

Answer: yes .. except HiveQL is not ANSI compliant

2) I am not sure whether the concept of schema in Hive ( or HiveQL ) is different from that of the usual RDMS . Is it and if so, how ? Is there a schema concept in PIG and the related PIG language ?

Answer: There is not a schema concept. Users can defined relationships between tables, such as those found in a star or snowflake schema .. using primary and foreign keys. Again, HiveQL is not ansi, so all SQL function may not be available.

For Pig latin, the concept of a “schema” in the fields and their data types in an alias or relation.

3) HiveQL in my mind sits on top of MapReduce and all HiveQL queries and compiled and optimized into one or more Map Reduce programs before these are run on the HADOOP cluster.

Answer: that’s correct. The same is true for Pig.

4) What happens in a Mapping as opposed to a Reduce. What operations would be translated into a Mapping or Reduce or both ?

Answer: a mapping is basically “copying” from A to B … Reduce is an aggregation or summation operation .. for instance.. using Sqoop to move data from a database to hadoop only uses a mapping operation.. just like a HiveQL select * statement. However, a HiveQL Group By statement would result in a reduce phase to summarize the results of a HiveQL query.

5) Why was Java chosen as the language for writing the code that would finally be run on the HADOOP cluster ?

Answer: probably because Java was the most popular and extensible language available to Doug Cutting and Mike Caferalla in 2005.

6) PIG is even more high level than HiveQL and is useful for applications that can be structured as a data-flow format. What would be real-life examples of applications that be suitable for PIG ( and Grunt) as opposed to HiveQL and Beeline? What would be examples where PIG would not be a good choice ?

Answer: basically .. HiveQL allows you to imply a structure on HDFS data .. and write SQL to manipulate the data. Pig is direct access to the data, and requires less resources to work with data then Pig. If I would working with unstructured data like IoT or social media, I would use Pig and some existing user-defined functions, like Page Rank Analysis to slice and dice the data. Also, Pig allows you to write UDFS in java and Python. As we know, Python is very extensible.

View solution in original post

Cynthia_sas · Posted 08-26-2019 03:32 PM

Hi:

Here's some feedback from the instructors about your questions.

Cynthia

1) HIve ( and HIVE QL ) are similar to RDBMSs like Oracle, Teradata, SQL Server etc. ( datatypes , database operations , table operations etc.)

Answer: yes .. except HiveQL is not ANSI compliant

2) I am not sure whether the concept of schema in Hive ( or HiveQL ) is different from that of the usual RDMS . Is it and if so, how ? Is there a schema concept in PIG and the related PIG language ?

Answer: There is not a schema concept. Users can defined relationships between tables, such as those found in a star or snowflake schema .. using primary and foreign keys. Again, HiveQL is not ansi, so all SQL function may not be available.

For Pig latin, the concept of a “schema” in the fields and their data types in an alias or relation.

3) HiveQL in my mind sits on top of MapReduce and all HiveQL queries and compiled and optimized into one or more Map Reduce programs before these are run on the HADOOP cluster.

Answer: that’s correct. The same is true for Pig.

4) What happens in a Mapping as opposed to a Reduce. What operations would be translated into a Mapping or Reduce or both ?

Answer: a mapping is basically “copying” from A to B … Reduce is an aggregation or summation operation .. for instance.. using Sqoop to move data from a database to hadoop only uses a mapping operation.. just like a HiveQL select * statement. However, a HiveQL Group By statement would result in a reduce phase to summarize the results of a HiveQL query.

5) Why was Java chosen as the language for writing the code that would finally be run on the HADOOP cluster ?

Answer: probably because Java was the most popular and extensible language available to Doug Cutting and Mike Caferalla in 2005.

6) PIG is even more high level than HiveQL and is useful for applications that can be structured as a data-flow format. What would be real-life examples of applications that be suitable for PIG ( and Grunt) as opposed to HiveQL and Beeline? What would be examples where PIG would not be a good choice ?

Answer: basically .. HiveQL allows you to imply a structure on HDFS data .. and write SQL to manipulate the data. Pig is direct access to the data, and requires less resources to work with data then Pig. If I would working with unstructured data like IoT or social media, I would use Pig and some existing user-defined functions, like Page Rank Analysis to slice and dice the data. Also, Pig allows you to write UDFS in java and Python. As we know, Python is very extensible.

BIg Data - Module 2: Need more detail and clarity on Hive vs PIG.

Re: BIg Data - Module 2: Need more detail and clarity on Hive vs PIG.

Re: BIg Data - Module 2: Need more detail and clarity on Hive vs PIG.

BIg Data - Module 2: Need more detail and clarity on Hive vs PIG.

Re: BIg Data - Module 2: Need more detail and clarity on Hive vs PIG.

Re: BIg Data - Module 2: Need more detail and clarity on Hive vs PIG.

SAS Training: Just a Click Away