BookmarkSubscribeRSS Feed
Jonathanzz
Obsidian | Level 7

I am trying to use text parsing to keep only useful words in the dataset using my specified start list. However, there is no any change in the text parsing exported dataset. The result shows that only staff, restaurant, swimming pool and spa with "Y" in keep column whereas all other words are "N", however, the exported dataset does not change anything at all.

 

Original dataset:

review_ID    review

0001           I am satisfied with the staff and also the restaurant

0002           The swimming pool and spa are amazing

...

 

Start list:

staff

restaurant

swimming pool

spa

 

Expected exported dataset from text parsing:

review_ID    review

0001           staff, restaurant

0002           swimming pool, spa

...

 

Actual exported dataset from text parsing: (not what I want)

review_ID    review

0001           I am satisfied with the staff and also the restaurant

0002           The swimming pool and spa are amazing

...

 

 

3 REPLIES 3
RussAlbright
SAS Employee

The Text Parse node creates an underlying representation in the Terms table (which you mentioned you saw) and a term-by-document frequency table that we refer to as the parent table. You cannot directly see this unless you look in your workspace project directory. 

When you follow the Text Parse node with a Text Filter node and other Text Mining nodes, these representations are used and not  the original input text in that export table. So the stopped terms are being used. It is not until you use a Text Cluster node or a Text Topics node that you see the change on the exported table. And even then, the change is in a set of columns that are the numeric representation of the document (taking into account your stopped terms). The actual raw input text is never changed and exported.

 

Russ


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

Jonathanzz
Obsidian | Level 7

Thanks! Russ.

 

However, if I want to perform link analysis on review factors which I specified in the start list (i.e. swimming pool, spa, etc.). How should I do it?

 

Should I edit the dataset to be like this?

 

review_ID    review

0001           I

0001           am

0001           satisfied

0001           with

....               .....

0002           The

0002           swimming

0002           pool

0002           and

0002           spa

....               .....

 

and let text parsing node to do his job to ignore those word not in the start list.

RussAlbright
SAS Employee

Jonathon,

 

You can use the parent table in the workspace directory. It has the form of triples

termnum  document    frequency 

 

In the end, in order to interpret results, you just have to map the termnum back to the term string from the terms table.

 

Russ


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1240 views
  • 0 likes
  • 2 in conversation