Hello,
Please refer to the attachment. This is part of the solution of the last exercise in Chapter 4. There are 5 lines of Pig Latin code and I am not sure that I am understanding the logic correctly and completely. I am writing down what I think is being done . Please tell me where I am correct and where I am not.
line 1: a = " loads mobydick text file into DIHPS folder. ( in Hive ? )
line 2: b = "For each row in "a" above ( that is in the mobydick text file ) put words separately on each consecutive physical line.
Why do we need a flatten statement here ? Would the TOKENIZE statement not be enough ?
line 3: c = "Are we grouping each row in Step a by "word". Can you please give an example here to clarify ?
line 4: d = "Are we counting the number of occurrences of a given word. For example a word like "whale" might have been used about 4000 times in the book.
line 5: We store a table with 2 columns in a table called pig_wordcount in the DIHPS back on the SAS server.
Thanks.
Odesh.
Yes, there was a screen shot (I can see it in the original post but I am re-posting it here):
Note the difference between using TOKENIZE (wordb1) and using FLATTEN and TOKENIZE (wordb2).
Cynthia
Hi:
Here's feedback from the instructors:
line 1: a = " loads mobydick text file into DIHPS folder. ( in Hive ? )
– the Pig script load operator loads the data directly from HDFS. Pig has no interaction with Hive.
line 2: b = "For each row in "a" above ( that is in the mobydick text file ) put words separately on each consecutive physical line.
Why do we need a flatten statement here ? Would the TOKENIZE statement not be enough ?
– the Flatten statement is required. Tokenize would not be enough. The Flatten statement is needed to take the tuples that are on the same line and put each tuple on a separate line. Once each line represents a word we can then group on the words as we wish to in order to get a word count.
Here's a way to create 2 different variables and see the impact of using FLATTEN:
line 3: c = "Are we grouping each row in Step a by "word". Can you please give an example here to clarify ?
– yes, it’s very similar to a SQL GROUP BY statement.
line 4: d = "Are we counting the number of occurrences of a given word. For example a word like "whale" might have been used about 4000 times in the book.
– yes.
line 5: We store a table with 2 columns in a table called pig_wordcount in the DIHPS back on the SAS server.
– yes. However, this table is a small subset of the original file and temp tables used to perform the word count. When possible, try to keep the results on the Hadoop cluster, unless absolutely necessary for additional SAS processing.
Hope this helps,
Cynthia
Yes, there was a screen shot (I can see it in the original post but I am re-posting it here):
Note the difference between using TOKENIZE (wordb1) and using FLATTEN and TOKENIZE (wordb2).
Cynthia
This is a knowledge-sharing community for learners in the Academy. Find answers to your questions or post here for a reply.
To ensure your success, use these getting-started resources:
Estimating Your Study Time
Reserving Software Lab Time
Most Commonly Asked Questions
Troubleshooting Your SAS-Hadoop Training Environment
Ready to level-up your skills? Choose your own adventure.