I am trying to use text parsing to keep only useful words in the dataset using my specified start list. However, there is no any change in the text parsing exported dataset. The result shows that only staff, restaurant, swimming pool and spa with "Y" in keep column whereas all other words are "N", however, the exported dataset does not change anything at all.
Original dataset:
review_ID review
0001 I am satisfied with the staff and also the restaurant
0002 The swimming pool and spa are amazing
...
Start list:
staff
restaurant
swimming pool
spa
Expected exported dataset from text parsing:
review_ID review
0001 staff, restaurant
0002 swimming pool, spa
...
Actual exported dataset from text parsing: (not what I want)
review_ID review
0001 I am satisfied with the staff and also the restaurant
0002 The swimming pool and spa are amazing
...
The Text Parse node creates an underlying representation in the Terms table (which you mentioned you saw) and a term-by-document frequency table that we refer to as the parent table. You cannot directly see this unless you look in your workspace project directory.
When you follow the Text Parse node with a Text Filter node and other Text Mining nodes, these representations are used and not the original input text in that export table. So the stopped terms are being used. It is not until you use a Text Cluster node or a Text Topics node that you see the change on the exported table. And even then, the change is in a set of columns that are the numeric representation of the document (taking into account your stopped terms). The actual raw input text is never changed and exported.
Russ
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
Thanks! Russ.
However, if I want to perform link analysis on review factors which I specified in the start list (i.e. swimming pool, spa, etc.). How should I do it?
Should I edit the dataset to be like this?
review_ID review
0001 I
0001 am
0001 satisfied
0001 with
.... .....
0002 The
0002 swimming
0002 pool
0002 and
0002 spa
.... .....
and let text parsing node to do his job to ignore those word not in the start list.
Jonathon,
You can use the parent table in the workspace directory. It has the form of triples
termnum document frequency
In the end, in order to interpret results, you just have to map the termnum back to the term string from the terms table.
Russ
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.