BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
odesh
Quartz | Level 8

Hello,

Please refer to the attachment. This is part of the solution of the last exercise in Chapter 4. There are 5 lines of Pig Latin code and I am not sure that I am understanding the logic correctly and completely. I am writing down what I think is being done . Please tell me where I am correct and where I am not. 

 

line 1: a = " loads mobydick text file into DIHPS folder. ( in Hive ? )

 

line 2: b = "For each row in "a" above ( that is in the mobydick text file ) put words separately on each consecutive physical line.

                  Why do we need a flatten statement here ? Would the TOKENIZE statement not be enough ?

 

line 3: c = "Are we grouping each row in Step a by "word". Can you please give an example here to clarify ?

 

line 4: d = "Are we counting the number of occurrences of a given word. For example a word like "whale" might have been used about                  4000 times in the book.

 

line 5: We store a table with 2 columns in a table called pig_wordcount in the DIHPS  back on the SAS server.

 

Thanks.

Odesh.

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Cynthia_sas
SAS Super FREQ

Yes, there was a screen shot (I can see it in the original post but I am re-posting it here):tokenize_flatten_compare.png

 

  Note the difference between using TOKENIZE (wordb1) and using FLATTEN and TOKENIZE (wordb2).

 

Cynthia

 

View solution in original post

4 REPLIES 4
Cynthia_sas
SAS Super FREQ

Hi:

Here's feedback from the instructors:

 

line 1: a = " loads mobydick text file into DIHPS folder. ( in Hive ? )

– the Pig script load operator loads the data directly from HDFS. Pig has no interaction with Hive.

line 2: b = "For each row in "a" above ( that is in the mobydick text file ) put words separately on each consecutive physical line.

Why do we need a flatten statement here ? Would the TOKENIZE statement not be enough ?

– the Flatten statement is required. Tokenize would not be enough. The Flatten statement is needed to take the tuples that are on the same line and put each tuple on a separate line. Once each line represents a word we can then group on the words as we wish to in order to get a word count.

Here's a way to create 2 different variables and see the impact of using FLATTEN:

tokenize_flatten_compare.png

 

 

line 3: c = "Are we grouping each row in Step a by "word". Can you please give an example here to clarify ?

– yes, it’s very similar to a SQL GROUP BY statement.

line 4: d = "Are we counting the number of occurrences of a given word. For example a word like "whale" might have been used about 4000 times in the book.

– yes.

line 5: We store a table with 2 columns in a table called pig_wordcount in the DIHPS back on the SAS server.

– yes. However, this table is a small subset of the original file and temp tables used to perform the word count. When possible, try to keep the results on the Hadoop cluster, unless absolutely necessary for additional SAS processing.

 

Hope this helps,

Cynthia

odesh
Quartz | Level 8
Very helpful but one question ..

After the line " Here's a way to create 2 different variables and see the
impact of using FLATTEN: "

Was there some additional information that going to be presented at that
point ?

Odesh.
Cynthia_sas
SAS Super FREQ

Yes, there was a screen shot (I can see it in the original post but I am re-posting it here):tokenize_flatten_compare.png

 

  Note the difference between using TOKENIZE (wordb1) and using FLATTEN and TOKENIZE (wordb2).

 

Cynthia

 

odesh
Quartz | Level 8
Excellent answer.

Thanks very much.
Odesh.

 

This is a knowledge-sharing community for learners in the Academy. Find answers to your questions or post here for a reply.
To ensure your success, use these getting-started resources:

Estimating Your Study Time
Reserving Software Lab Time
Most Commonly Asked Questions
Troubleshooting Your SAS-Hadoop Training Environment

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1066 views
  • 1 like
  • 2 in conversation