BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
azaras
Obsidian | Level 7

Hi, I want to extract in a data set the terms kept in the text parsing node of SAS Visual Text Analytics. If i click open in the relevant node i can see the kept terms and the dropped terms. But i haven't found a way to extract them in a data set.

 

Can you do this in SAS VFL?

 

Thanks in advance,

 

 

Andreas

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PeterChristie
SAS Employee

Hello @azaras ! I just successfully verified I could create a dataset of terms from a document collection on VFL.

 

In Visual Text Analytics, the user interface does not show all possible options.  You have to go behind the scenes and use the cas actions to do things like create a data set of terms. Use the tpParse and tpAccumulate actions to create the table you want.

 

The following program shows how to create a dataset of terms. You will have to update the cashost name. There is more information here:

SAS Help Center: Accumulate Term Information Using the tpAccumulate Action

 

 

options casport=5570 cashost="cloud.example.com";             /**/
cas casauto;
libname mycas cas; 

data mycas.reviews;                                           /**/
   infile datalines delimiter='|' missover;
   length text $300 category $20;
   input text$ positive category$ did;
   datalines;
      This is the greatest phone ever! love it!|1|electronics|1
      The phone's battery life is too short and screen resolution is low.|0|electronics|2
      The screen resolution is low, but I love this tv.|1|electronics|3
      The movie itself is great and I like it, although the resolution is low.|1|movies|4
      The movie's story is boring and the acting is poor.|0|movies|5
      I watched this movie on tv, it's not good on a small screen. |0|movies|6
      watched the movie first and loved it, the book is even better!|1|books |7
      I like the story in this book, they should put it on screen.|1|books|8
      I love the author, but this book is a waste of time, don't buy it.|0|books|9
   ;
run;

proc cas;                                                     /**/

   textParse.tpParse /                                        /**/
      docId="did"
      offset={name="pos"}
      table={name="reviews"}
      text="text";
   run;

   textParse.tpAccumulate /                                   /**/
      child={name="child"}
      offset={name="pos"}
      parent={name="parent"}
      reduce=1
      terms={name="terms"};
   run;

   table.fetch /                                              /**/
      table={name="terms"}
      sortby={"_term_", "_role_"};
   run;

   table.fetch /                                              /**/
      table={name="parent"}
      sortby={"_document_", "_termNum_"};
   run;

   table.fetch /                                              /**/
      table={name="child"}
      sortby={"_Document_", "_termnum_"};
   run;

quit;                                                         /**/

 

 

Hope this helps get you moving in the right direction!

 

View solution in original post

4 REPLIES 4
PeterChristie
SAS Employee

Hello @azaras ! I just successfully verified I could create a dataset of terms from a document collection on VFL.

 

In Visual Text Analytics, the user interface does not show all possible options.  You have to go behind the scenes and use the cas actions to do things like create a data set of terms. Use the tpParse and tpAccumulate actions to create the table you want.

 

The following program shows how to create a dataset of terms. You will have to update the cashost name. There is more information here:

SAS Help Center: Accumulate Term Information Using the tpAccumulate Action

 

 

options casport=5570 cashost="cloud.example.com";             /**/
cas casauto;
libname mycas cas; 

data mycas.reviews;                                           /**/
   infile datalines delimiter='|' missover;
   length text $300 category $20;
   input text$ positive category$ did;
   datalines;
      This is the greatest phone ever! love it!|1|electronics|1
      The phone's battery life is too short and screen resolution is low.|0|electronics|2
      The screen resolution is low, but I love this tv.|1|electronics|3
      The movie itself is great and I like it, although the resolution is low.|1|movies|4
      The movie's story is boring and the acting is poor.|0|movies|5
      I watched this movie on tv, it's not good on a small screen. |0|movies|6
      watched the movie first and loved it, the book is even better!|1|books |7
      I like the story in this book, they should put it on screen.|1|books|8
      I love the author, but this book is a waste of time, don't buy it.|0|books|9
   ;
run;

proc cas;                                                     /**/

   textParse.tpParse /                                        /**/
      docId="did"
      offset={name="pos"}
      table={name="reviews"}
      text="text";
   run;

   textParse.tpAccumulate /                                   /**/
      child={name="child"}
      offset={name="pos"}
      parent={name="parent"}
      reduce=1
      terms={name="terms"};
   run;

   table.fetch /                                              /**/
      table={name="terms"}
      sortby={"_term_", "_role_"};
   run;

   table.fetch /                                              /**/
      table={name="parent"}
      sortby={"_document_", "_termNum_"};
   run;

   table.fetch /                                              /**/
      table={name="child"}
      sortby={"_Document_", "_termnum_"};
   run;

quit;                                                         /**/

 

 

Hope this helps get you moving in the right direction!

 

azaras
Obsidian | Level 7
Thanks!

I managed to do ti witht his code. But what is the "pos" in the code?
BR,
Andreas
azaras
Obsidian | Level 7
Hi again!

In the pos table in my example is see around 63000 term but in the terms table i see around 11000. The 11000 are the kept terms as shown in the SAS Viosual Text Analytics so the dropped terms are the difference? Can i get the drop and keep lists that SAS uses by default?
Thanks
PeterChristie
SAS Employee

Hello @azaras  “pos” in the code is just the name of the table that gets created to contain the position of child terms in each document of the document collection by the “offset” parameter. It records the start and end position for a term in a document.

 

The count of child terms in your pos table is higher than in the terms table since a term can appear in multiple documents. The terms table lists the individual terms once and shows the frequency of terms in the document collection. The difference in count is not due to dropped terms. The pos table has a column “_Document_” that you can check if you want to verify what is occurring.

 

As for the stop list


The Environment Manager > Data > ReferenceData folder has the default stoplists. The en_stoplist file for example contains the default stop words for English. There are over 1200 words in that list. Customers often create their own stop word lists for domain specific document collections.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Tips for filtering data sources in SAS Visual Analytics

See how to use one filter for multiple data sources by mapping your data from SAS’ Alexandria McCall.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 886 views
  • 2 likes
  • 2 in conversation