Solved: Re: Big Data - Module 2 Chapter 4 - Pig Latin

odesh · Posted 09-21-2019 04:46 PM

Hello,

Please refer to the attachment. This is part of the solution of the last exercise in Chapter 4. There are 5 lines of Pig Latin code and I am not sure that I am understanding the logic correctly and completely. I am writing down what I think is being done . Please tell me where I am correct and where I am not.

line 1: a = " loads mobydick text file into DIHPS folder. ( in Hive ? )

line 2: b = "For each row in "a" above ( that is in the mobydick text file ) put words separately on each consecutive physical line.

Why do we need a flatten statement here ? Would the TOKENIZE statement not be enough ?

line 3: c = "Are we grouping each row in Step a by "word". Can you please give an example here to clarify ?

line 4: d = "Are we counting the number of occurrences of a given word. For example a word like "whale" might have been used about 4000 times in the book.

line 5: We store a table with 2 columns in a table called pig_wordcount in the DIHPS back on the SAS server.

Thanks.

Odesh.

Cynthia_sas · Posted 09-23-2019 01:55 PM

Yes, there was a screen shot (I can see it in the original post but I am re-posting it here):

Note the difference between using TOKENIZE (wordb1) and using FLATTEN and TOKENIZE (wordb2).

Cynthia

View solution in original post

Cynthia_sas · Posted 09-23-2019 11:52 AM

Hi:

Here's feedback from the instructors:

line 1: a = " loads mobydick text file into DIHPS folder. ( in Hive ? )

– the Pig script load operator loads the data directly from HDFS. Pig has no interaction with Hive.

line 2: b = "For each row in "a" above ( that is in the mobydick text file ) put words separately on each consecutive physical line.

Why do we need a flatten statement here ? Would the TOKENIZE statement not be enough ?

– the Flatten statement is required. Tokenize would not be enough. The Flatten statement is needed to take the tuples that are on the same line and put each tuple on a separate line. Once each line represents a word we can then group on the words as we wish to in order to get a word count.

Here's a way to create 2 different variables and see the impact of using FLATTEN:

line 3: c = "Are we grouping each row in Step a by "word". Can you please give an example here to clarify ?

– yes, it’s very similar to a SQL GROUP BY statement.

line 4: d = "Are we counting the number of occurrences of a given word. For example a word like "whale" might have been used about 4000 times in the book.

– yes.

line 5: We store a table with 2 columns in a table called pig_wordcount in the DIHPS back on the SAS server.

– yes. However, this table is a small subset of the original file and temp tables used to perform the word count. When possible, try to keep the results on the Hadoop cluster, unless absolutely necessary for additional SAS processing.

Hope this helps,

Cynthia

odesh · Posted 09-23-2019 01:27 PM

Very helpful but one question ..

After the line " Here's a way to create 2 different variables and see the
impact of using FLATTEN: "

Was there some additional information that going to be presented at that
point ?

Odesh.

Cynthia_sas · Posted 09-23-2019 01:55 PM

Yes, there was a screen shot (I can see it in the original post but I am re-posting it here):

Note the difference between using TOKENIZE (wordb1) and using FLATTEN and TOKENIZE (wordb2).

Cynthia

odesh · Posted 09-23-2019 04:07 PM

Excellent answer.

Thanks very much.
Odesh.

Big Data - Module 2 Chapter 4 - Pig Latin

Re: Big Data - Module 2 Chapter 4 - Pig Latin

Re: Big Data - Module 2 Chapter 4 - Pig Latin

Re: Big Data - Module 2 Chapter 4 - Pig Latin

Re: Big Data - Module 2 Chapter 4 - Pig Latin

Re: Big Data - Module 2 Chapter 4 - Pig Latin

SAS Training: Just a Click Away