Solved: SAS Visual Text Analytics on SAS VFL

azaras · Posted 11-06-2023 09:49 AM

Hi, I want to extract in a data set the terms kept in the text parsing node of SAS Visual Text Analytics. If i click open in the relevant node i can see the kept terms and the dropped terms. But i haven't found a way to extract them in a data set.

Can you do this in SAS VFL?

Thanks in advance,

Andreas

PeterChristie · Posted 11-06-2023 02:57 PM

Hello @azaras ! I just successfully verified I could create a dataset of terms from a document collection on VFL.

In Visual Text Analytics, the user interface does not show all possible options. You have to go behind the scenes and use the cas actions to do things like create a data set of terms. Use the tpParse and tpAccumulate actions to create the table you want.

The following program shows how to create a dataset of terms. You will have to update the cashost name. There is more information here:

SAS Help Center: Accumulate Term Information Using the tpAccumulate Action

options casport=5570 cashost="cloud.example.com";             /**/
cas casauto;
libname mycas cas; 

data mycas.reviews;                                           /**/
   infile datalines delimiter='|' missover;
   length text $300 category $20;
   input text$ positive category$ did;
   datalines;
      This is the greatest phone ever! love it!|1|electronics|1
      The phone's battery life is too short and screen resolution is low.|0|electronics|2
      The screen resolution is low, but I love this tv.|1|electronics|3
      The movie itself is great and I like it, although the resolution is low.|1|movies|4
      The movie's story is boring and the acting is poor.|0|movies|5
      I watched this movie on tv, it's not good on a small screen. |0|movies|6
      watched the movie first and loved it, the book is even better!|1|books |7
      I like the story in this book, they should put it on screen.|1|books|8
      I love the author, but this book is a waste of time, don't buy it.|0|books|9
   ;
run;

proc cas;                                                     /**/

   textParse.tpParse /                                        /**/
      docId="did"
      offset={name="pos"}
      table={name="reviews"}
      text="text";
   run;

   textParse.tpAccumulate /                                   /**/
      child={name="child"}
      offset={name="pos"}
      parent={name="parent"}
      reduce=1
      terms={name="terms"};
   run;

   table.fetch /                                              /**/
      table={name="terms"}
      sortby={"_term_", "_role_"};
   run;

   table.fetch /                                              /**/
      table={name="parent"}
      sortby={"_document_", "_termNum_"};
   run;

   table.fetch /                                              /**/
      table={name="child"}
      sortby={"_Document_", "_termnum_"};
   run;

quit;                                                         /**/

Hope this helps get you moving in the right direction!

View solution in original post

PeterChristie · Posted 11-06-2023 02:57 PM

Hello @azaras ! I just successfully verified I could create a dataset of terms from a document collection on VFL.

In Visual Text Analytics, the user interface does not show all possible options. You have to go behind the scenes and use the cas actions to do things like create a data set of terms. Use the tpParse and tpAccumulate actions to create the table you want.

The following program shows how to create a dataset of terms. You will have to update the cashost name. There is more information here:

SAS Help Center: Accumulate Term Information Using the tpAccumulate Action

options casport=5570 cashost="cloud.example.com";             /**/
cas casauto;
libname mycas cas; 

data mycas.reviews;                                           /**/
   infile datalines delimiter='|' missover;
   length text $300 category $20;
   input text$ positive category$ did;
   datalines;
      This is the greatest phone ever! love it!|1|electronics|1
      The phone's battery life is too short and screen resolution is low.|0|electronics|2
      The screen resolution is low, but I love this tv.|1|electronics|3
      The movie itself is great and I like it, although the resolution is low.|1|movies|4
      The movie's story is boring and the acting is poor.|0|movies|5
      I watched this movie on tv, it's not good on a small screen. |0|movies|6
      watched the movie first and loved it, the book is even better!|1|books |7
      I like the story in this book, they should put it on screen.|1|books|8
      I love the author, but this book is a waste of time, don't buy it.|0|books|9
   ;
run;

proc cas;                                                     /**/

   textParse.tpParse /                                        /**/
      docId="did"
      offset={name="pos"}
      table={name="reviews"}
      text="text";
   run;

   textParse.tpAccumulate /                                   /**/
      child={name="child"}
      offset={name="pos"}
      parent={name="parent"}
      reduce=1
      terms={name="terms"};
   run;

   table.fetch /                                              /**/
      table={name="terms"}
      sortby={"_term_", "_role_"};
   run;

   table.fetch /                                              /**/
      table={name="parent"}
      sortby={"_document_", "_termNum_"};
   run;

   table.fetch /                                              /**/
      table={name="child"}
      sortby={"_Document_", "_termnum_"};
   run;

quit;                                                         /**/

Hope this helps get you moving in the right direction!

azaras · Posted 11-07-2023 05:07 AM

Thanks!

I managed to do ti witht his code. But what is the "pos" in the code?
BR,
Andreas

azaras · Posted 11-07-2023 05:21 AM

Hi again!

In the pos table in my example is see around 63000 term but in the terms table i see around 11000. The 11000 are the kept terms as shown in the SAS Viosual Text Analytics so the dropped terms are the difference? Can i get the drop and keep lists that SAS uses by default?
Thanks

PeterChristie · Posted 11-07-2023 09:26 AM

Hello @azaras “pos” in the code is just the name of the table that gets created to contain the position of child terms in each document of the document collection by the “offset” parameter. It records the start and end position for a term in a document.

The count of child terms in your pos table is higher than in the terms table since a term can appear in multiple documents. The terms table lists the individual terms once and shows the frequency of terms in the document collection. The difference in count is not due to dropped terms. The pos table has a column “_Document_” that you can check if you want to verify what is occurring.

As for the stop list

The Environment Manager > Data > ReferenceData folder has the default stoplists. The en_stoplist file for example contains the default stop words for English. There are over 1200 words in that list. Customers often create their own stop word lists for domain specific document collections.

SAS Visual Text Analytics on SAS VFL

Re: SAS Visual Text Analytics on SAS VFL

Re: SAS Visual Text Analytics on SAS VFL

Re: SAS Visual Text Analytics on SAS VFL

Re: SAS Visual Text Analytics on SAS VFL

Re: SAS Visual Text Analytics on SAS VFL

SAS Visual Text Analytics on SAS VFL

Re: SAS Visual Text Analytics on SAS VFL

Re: SAS Visual Text Analytics on SAS VFL

Re: SAS Visual Text Analytics on SAS VFL

Re: SAS Visual Text Analytics on SAS VFL

Re: SAS Visual Text Analytics on SAS VFL

Ready to join fellow brilliant minds for the SAS Hackathon?