Hi, I want to extract in a data set the terms kept in the text parsing node of SAS Visual Text Analytics. If i click open in the relevant node i can see the kept terms and the dropped terms. But i haven't found a way to extract them in a data set.
Can you do this in SAS VFL?
Thanks in advance,
Andreas
Hello @azaras ! I just successfully verified I could create a dataset of terms from a document collection on VFL.
In Visual Text Analytics, the user interface does not show all possible options. You have to go behind the scenes and use the cas actions to do things like create a data set of terms. Use the tpParse and tpAccumulate actions to create the table you want.
The following program shows how to create a dataset of terms. You will have to update the cashost name. There is more information here:
SAS Help Center: Accumulate Term Information Using the tpAccumulate Action
options casport=5570 cashost="cloud.example.com"; /**/
cas casauto;
libname mycas cas;
data mycas.reviews; /**/
infile datalines delimiter='|' missover;
length text $300 category $20;
input text$ positive category$ did;
datalines;
This is the greatest phone ever! love it!|1|electronics|1
The phone's battery life is too short and screen resolution is low.|0|electronics|2
The screen resolution is low, but I love this tv.|1|electronics|3
The movie itself is great and I like it, although the resolution is low.|1|movies|4
The movie's story is boring and the acting is poor.|0|movies|5
I watched this movie on tv, it's not good on a small screen. |0|movies|6
watched the movie first and loved it, the book is even better!|1|books |7
I like the story in this book, they should put it on screen.|1|books|8
I love the author, but this book is a waste of time, don't buy it.|0|books|9
;
run;
proc cas; /**/
textParse.tpParse / /**/
docId="did"
offset={name="pos"}
table={name="reviews"}
text="text";
run;
textParse.tpAccumulate / /**/
child={name="child"}
offset={name="pos"}
parent={name="parent"}
reduce=1
terms={name="terms"};
run;
table.fetch / /**/
table={name="terms"}
sortby={"_term_", "_role_"};
run;
table.fetch / /**/
table={name="parent"}
sortby={"_document_", "_termNum_"};
run;
table.fetch / /**/
table={name="child"}
sortby={"_Document_", "_termnum_"};
run;
quit; /**/
Hope this helps get you moving in the right direction!
Hello @azaras ! I just successfully verified I could create a dataset of terms from a document collection on VFL.
In Visual Text Analytics, the user interface does not show all possible options. You have to go behind the scenes and use the cas actions to do things like create a data set of terms. Use the tpParse and tpAccumulate actions to create the table you want.
The following program shows how to create a dataset of terms. You will have to update the cashost name. There is more information here:
SAS Help Center: Accumulate Term Information Using the tpAccumulate Action
options casport=5570 cashost="cloud.example.com"; /**/
cas casauto;
libname mycas cas;
data mycas.reviews; /**/
infile datalines delimiter='|' missover;
length text $300 category $20;
input text$ positive category$ did;
datalines;
This is the greatest phone ever! love it!|1|electronics|1
The phone's battery life is too short and screen resolution is low.|0|electronics|2
The screen resolution is low, but I love this tv.|1|electronics|3
The movie itself is great and I like it, although the resolution is low.|1|movies|4
The movie's story is boring and the acting is poor.|0|movies|5
I watched this movie on tv, it's not good on a small screen. |0|movies|6
watched the movie first and loved it, the book is even better!|1|books |7
I like the story in this book, they should put it on screen.|1|books|8
I love the author, but this book is a waste of time, don't buy it.|0|books|9
;
run;
proc cas; /**/
textParse.tpParse / /**/
docId="did"
offset={name="pos"}
table={name="reviews"}
text="text";
run;
textParse.tpAccumulate / /**/
child={name="child"}
offset={name="pos"}
parent={name="parent"}
reduce=1
terms={name="terms"};
run;
table.fetch / /**/
table={name="terms"}
sortby={"_term_", "_role_"};
run;
table.fetch / /**/
table={name="parent"}
sortby={"_document_", "_termNum_"};
run;
table.fetch / /**/
table={name="child"}
sortby={"_Document_", "_termnum_"};
run;
quit; /**/
Hope this helps get you moving in the right direction!
Hello @azaras “pos” in the code is just the name of the table that gets created to contain the position of child terms in each document of the document collection by the “offset” parameter. It records the start and end position for a term in a document.
The count of child terms in your pos table is higher than in the terms table since a term can appear in multiple documents. The terms table lists the individual terms once and shows the frequency of terms in the document collection. The difference in count is not due to dropped terms. The pos table has a column “_Document_” that you can check if you want to verify what is occurring.
As for the stop list
The Environment Manager > Data > ReferenceData folder has the default stoplists. The en_stoplist file for example contains the default stop words for English. There are over 1200 words in that list. Customers often create their own stop word lists for domain specific document collections.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
See how to use one filter for multiple data sources by mapping your data from SAS’ Alexandria McCall.
Find more tutorials on the SAS Users YouTube channel.