BookmarkSubscribeRSS Feed
chekurti
Calcite | Level 5

Hi All,

I am looking for a way to join the text topic node output with the raw data so that i can see each document is falling under which topic. Let me give some example.My input data looks like:

id    text                                                 product

1     Amber blink continuously                  x142

2     Memory space problem                     x189

3     video blurr                                        x902

The text topic node gave me the output as

topic id                        topic
1                                 memory

2                                 amber

3                                 video

I want to create the data set combing the raw data and text topic node results like

id    text                                                 product               topic

1     Amber blink continuously                  x142                   amber

2     Memory space problem                     x189                   memory

3     video blurr                                        x902                    video

I checked the intermediate dataset which text topic nodes but no dataset has the raw data + topic node results.

Can you guys please help me with this.

Thanks in Advance

Srini

3 REPLIES 3
chekurti
Calcite | Level 5

I tried used the Proc TGPARSE to extract the certain keywords from the text and able to join back back with the RAW data. But i am not able to control the text mining.Like i am not able to use stoplist and start list together.whether there is any option of passing the synonymms.

JustinPlumley
SAS Employee

Hi,

The dataset exported from the Text Topic Node includes the original (raw) data plus includes new variables associated with the topics.  Simplifying slightly, the dataset exported from the Text Topic Node will contain additional binary indicators (whether the record belongs to the topic or not) as well as raw scores (like a projection onto that topic - mocked values in the example below):

id    text                                                 product               _1_0_amber     _1_0_memory     _1_0_video     amber     memory     video

1     Amber blink continuously                  x142                   1                         0                         0                    .9               .05          .08

2     Memory space problem                     x189                  0                         1                         0                     .1               2.3          .07

3     video blurr                                        x902                    0                         0                         1                    .1                 .1           1.1

As you can see, this includes the original (raw) dataset of 3 records, then has 3 additional binary columns (corresponding to the 3 topics) and 3 additional raw topic score columns (also corresponding to the 3 topics).

At the moment, I am not sure why you would use both a START list (which says 'only use these terms') and a STOP list (which says 'do not use these terms') at the same time.  It seems that I would simply remove words from the START list that also occurred in the STOP list, and then use this new START list.  Would that help what you are trying to accomplish?

As far as importing synonyms, that is available at the Text Filter node.

Hope this helps!

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 2164 views
  • 0 likes
  • 2 in conversation