We’re smarter together. Learn from this collection of community knowledge and add your expertise.

SAS Studio Custom Task Tuesday: How to Perform Text Analysis

by SAS Employee OliviaWright on ‎02-21-2017 02:12 PM - edited on ‎03-24-2017 05:08 PM by Community Manager (749 Views)

This week, our Custom Task Tuesday post will be focused on creating a task that does a simple text analysis. If you’ve ever wondered how many times the word “Alice” is mentioned in the book “Alice in Wonderland,” then today is your lucky day!

 Custom Task Tuesday.jpg

The Text Analysis task performs a basic frequency analysis on text from Project Gutenberg. It will allow the user to input their own text URL from Project Gutenberg, or use one of the built in example texts. It will output an analysis of the proper case words in the text (hopefully, important names and places).

 

This is what the task will look like when we are finished, along with sample output:

 

Text Analysis.PNG

 

Step 1: Getting Started

 new task.png

In SAS Studio, under the Task and Utilities section, open a “New Task” as well as the “Sample Task.” We will copy and paste the necessary Velocity Template code from the Sample Task to our task.

 

Step 2: Naming and Saving the Text Analysis Task

 

Name: Text Analysis

Description: This task downloads the user’s specified text from Project Gutenberg, and then analysis the word count frequencies and outputs a table for the user.

At the top of the VTL code for your New Task, you will need to fill in the Name and Description portions as shown below:

 

 1.png

After you’ve done that, you should save this task to your My Tasks folder, so you don’t lose it. Click the edity.pngbutton in the upper left corner of the task to bring up this option screen:

 

2.png

 

Step 3: Creating the Metadata Portion of the Text Analysis Task

 

Just like in previous blog posts, we will be stealing VTL code from the Sample Task.

 

From our “finished product,” you can see that we are going to use two text boxes and two checkboxes. Find the code that correspond with those items in the Metadata section of the Sample Task, and copy and paste them into the same place in your task. Edit to code you copied to correspond with what we want as our finished product.

 

This is what your finished Metadata portion should look like (without the text for the labels, for the sake of brevity):

 

 <Metadata>
        <Options>
              <Option inputType="string" name="DATATAB">DATA</Option>
              <Option inputType="string" name="DATAGROUP">DATA</Option>
              <Option inputType="string" name="GROUPCHECK">REQUESTED OUTPUT</Option>
              <Option inputType="string" name="labelCHECK">”Label”</Option>
              <Option defaultValue="0" inputType="checkbox" name="chk1">List of all Proper Case Words in Descending Frequency Order</Option>
              <Option defaultValue="0" inputType="checkbox" name="chk2">List of all Capitalized Words in Descending Frequency Order, Not Including 100 Most Common Words</Option>
              <Option inputType="string" name="GROUPCOMBO">TEXT CHOICE</Option>
              <Option inputType="string" name="labelCOMBO">”Label”</Option>
              <Option inputType="string" name="labelCOMBO2">”Label”</Option>
              <Option defaultValue="http://www.gutenberg.org/cache/epub/35688/pg35688.txt" inputType="combobox" name="bookCHOICE">Examples:</Option>
              <Option inputType="string" name="http://www.gutenberg.org/cache/epub/35688/pg35688.txt">Alice in Wonderland</Option>
              <Option inputType="string" name="http://www.gutenberg.org/cache/epub/84/pg84.txt">Frankenstein</Option>
              <Option inputType="string" name="http://www.gutenberg.org/cache/epub/5200/pg5200.txt">Metamorphosis</Option>
              <Option inputType="string" name="http://www.gutenberg.org/cache/epub/768/pg768.txt">Wuthering Heights</Option>
              <Option inputType="string" name="labelTEXT">”Label.”</Option>
              <Option inputType="string" name="labelTEXT2">”Label.”</Option>
              <Option inputType="string" name="labelTEXT3">”Label.”</Option>
              <Option defaultValue="" indent="1" inputType="inputtext" missingMessage="Missing url." name="textURL" promptMessage="Enter a project gutenburg url.">Input Project Gutenburg URL:</Option>
        </Options>
</Metadata>

 

 

Step 4: Creating the UI Portion of the Text Analysis Task

 

Each object that we just put code for in the metadata portion will have corresponding code in the UI section. Just like we did in step 3, find the code that correspond with text boxes and checkboxes in the UI section of the Sample Task, and copy and paste them into the same place in your task. Edit to code you copied to correspond with what we want as our finished product.

 

This is what your finished UI portion should look like:

 

<UI>
<Container option="DATATAB">
              <Group open="true" option="GROUPCOMBO">
                     <OptionItem option="labelCOMBO"/>
                     <OptionItem option="labelCOMBO2"/>
                     <OptionChoice option="bookCHOICE">
                         <OptionItem option="http://www.gutenberg.org/cache/epub/35688/pg35688.txt"/>
                         <OptionItem option="http://www.gutenberg.org/cache/epub/84/pg84.txt"/>
                         <OptionItem option="http://www.gutenberg.org/cache/epub/5200/pg5200.txt"/>
                         <OptionItem option="http://www.gutenberg.org/cache/epub/768/pg768.txt"/>
                     </OptionChoice>
                     <OptionItem option="labelTEXT"/>
                     <OptionItem option="labelTEXT2"/>
                     <OptionItem option="labelTEXT3"/>
                     <OptionItem option="textURL"/>
              </Group>
              <Group open="true" option="GROUPCHECK">
                     <OptionItem option="labelCHECK"/>
                     <OptionItem option="chk1"/>
                     <OptionItem option="chk2"/>
              </Group>
       </Container>
</UI>

 

Step 5: Creating the Code Template Portion of the Text Analysis Task

 

This is the portion of the task that contains your SAS Code. The approach I took to reading in the data was to use the URL option in a filename statement, and use several SAS macro variables for different pieces of the URL.

 

My full code is available for download at the bottom of this blog post and on GitHub, but the important thing to note here is how the SAS code works with the VTL code. Velocity Template Language has its own macro variables, and each of our UI elements has one. When you select an option in the UI, the value of the VTL macro variable will change immediately.

 

For this task, I chose to create SAS macro variables with the values of the VTL macro variables because I already had this SAS code before creating the task, but it is not necessary.

This SAS code reads in the user chosen text URL from Project Gutenberg, and creates a dataset, where the text is separated into words that are converted to upper case.

 

ods noproctitle;
title;
%let exampletext = "$bookCHOICE";
%let textURL = "$textURL";
%let chk1     = $chk1; *Value of checkbox 1;
%let chk2     = $chk2; *Value of checkbox 2;
 
data _null_;
       textURL = &textURL;
       if textURL = "" then call symput('text', &exampletext);
       else if textURL ^= "" then call symput('text', &textURL);
run;
 
filename ibiblio url "&text" proxy = 'http://www.gutenberg.org/';
 
data TEXT_ORIGINAL;
       Infile ibiblio DLM=' ';
       Input Original: $32. @@;
              Word = compress(Original, ".,!?""[]");
              Word_Upcase = upcase(Word);
              Order = _N_;
              Capital = anyupper(Word);
              Punctuation = index(Original,'.');
run;

 

The remaining SAS Code contains two macros, one that will be executed if the first check box is checked and another that will be executed if the second checkbox is checked. They are both the same, except that one removes a list of 100 most common words from the dataset before doing the frequency analysis. You can download the full task to see the rest of the code, as it is too long to include in this post.

 

Step 6: Run the Text Analysis Task

 

You’re finished! You just created a custom user interface to download text, turn it in to a dataset of words, and analyze the frequencies. Click the save.pngbutton to save, then click the run.pngbutton to open the task. Make your selections, then click run.pngagain to watch it run!

 

Want to try it yourself?

Get the code from the zip file at the end of this article or from GitHub.

Take Me to GitHub!

 

Attachment
Your turn
Sign In!

Want to write an article? Sign in with your profile.