topic Re: Big Data - Module 2 Chapter 4 - Pig Latin in SAS Academy for Data Science

Big Data - Module 2 Chapter 4 - Pig Latin

odesh — Sat, 21 Sep 2019 20:46:27 GMT

Hello,

Please refer to the attachment. This is part of the solution of the last exercise in Chapter 4. There are 5 lines of Pig Latin code and I am not sure that I am understanding the logic correctly and completely. I am writing down what I think is being done . Please tell me where I am correct and where I am not.

line 1: a = " loads mobydick text file into DIHPS folder. ( in Hive ? )

line 2: b = "For each row in "a" above ( that is in the mobydick text file ) put words separately on each consecutive physical line.

Why do we need a flatten statement here ? Would the TOKENIZE statement not be enough ?

line 3: c = "Are we grouping each row in Step a by "word". Can you please give an example here to clarify ?

line 4: d = "Are we counting the number of occurrences of a given word. For example a word like "whale" might have been used about 4000 times in the book.

line 5: We store a table with 2 columns in a table called pig_wordcount in the DIHPS back on the SAS server.

Thanks.

Odesh.

Re: Big Data - Module 2 Chapter 4 - Pig Latin

Cynthia_sas — Mon, 23 Sep 2019 15:52:39 GMT

Hi:

Here's feedback from the instructors:

line 1: a = " loads mobydick text file into DIHPS folder. ( in Hive ? )

– the Pig script load operator loads the data directly from HDFS. Pig has no interaction with Hive.

line 2: b = "For each row in "a" above ( that is in the mobydick text file ) put words separately on each consecutive physical line.

Why do we need a flatten statement here ? Would the TOKENIZE statement not be enough ?

– the Flatten statement is required. Tokenize would not be enough. The Flatten statement is needed to take the tuples that are on the same line and put each tuple on a separate line. Once each line represents a word we can then group on the words as we wish to in order to get a word count.

Here's a way to create 2 different variables and see the impact of using FLATTEN:

line 3: c = "Are we grouping each row in Step a by "word". Can you please give an example here to clarify ?

– yes, it’s very similar to a SQL GROUP BY statement.

line 4: d = "Are we counting the number of occurrences of a given word. For example a word like "whale" might have been used about 4000 times in the book.

– yes.

line 5: We store a table with 2 columns in a table called pig_wordcount in the DIHPS back on the SAS server.

– yes. However, this table is a small subset of the original file and temp tables used to perform the word count. When possible, try to keep the results on the Hadoop cluster, unless absolutely necessary for additional SAS processing.

Hope this helps,

Cynthia

Re: Big Data - Module 2 Chapter 4 - Pig Latin

odesh — Mon, 23 Sep 2019 17:27:32 GMT

Very helpful but one question ..

After the line " Here's a way to create 2 different variables and see the
impact of using FLATTEN: "

Was there some additional information that going to be presented at that
point ?

Odesh.

Re: Big Data - Module 2 Chapter 4 - Pig Latin

Cynthia_sas — Mon, 23 Sep 2019 17:55:13 GMT

Yes, there was a screen shot (I can see it in the original post but I am re-posting it here):

Note the difference between using TOKENIZE (wordb1) and using FLATTEN and TOKENIZE (wordb2).

Cynthia

Re: Big Data - Module 2 Chapter 4 - Pig Latin

odesh — Mon, 23 Sep 2019 20:07:04 GMT

Excellent answer.

Thanks very much.
Odesh.