Text mining and content categorization

Joining the text topic node output with the input data

Reply
Occasional Contributor
Posts: 5

Joining the text topic node output with the input data

Hi All,

I am looking for a way to join the text topic node output with the raw data so that i can see each document is falling under which topic. Let me give some example.My input data looks like:

id    text                                                 product

1     Amber blink continuously                  x142

2     Memory space problem                     x189

3     video blurr                                        x902

The text topic node gave me the output as

topic id                        topic
1                                 memory

2                                 amber

3                                 video

I want to create the data set combing the raw data and text topic node results like

id    text                                                 product               topic

1     Amber blink continuously                  x142                   amber

2     Memory space problem                     x189                   memory

3     video blurr                                        x902                    video

I checked the intermediate dataset which text topic nodes but no dataset has the raw data + topic node results.

Can you guys please help me with this.

Thanks in Advance

Srini

Occasional Contributor
Posts: 5

Re: Joining the text topic node output with the input data

Help please

Occasional Contributor
Posts: 5

Re: Joining the text topic node output with the input data

I tried used the Proc TGPARSE to extract the certain keywords from the text and able to join back back with the RAW data. But i am not able to control the text mining.Like i am not able to use stoplist and start list together.whether there is any option of passing the synonymms.

SAS Employee
Posts: 5

Re: Joining the text topic node output with the input data

Hi,

The dataset exported from the Text Topic Node includes the original (raw) data plus includes new variables associated with the topics.  Simplifying slightly, the dataset exported from the Text Topic Node will contain additional binary indicators (whether the record belongs to the topic or not) as well as raw scores (like a projection onto that topic - mocked values in the example below):

id    text                                                 product               _1_0_amber     _1_0_memory     _1_0_video     amber     memory     video

1     Amber blink continuously                  x142                   1                         0                         0                    .9               .05          .08

2     Memory space problem                     x189                  0                         1                         0                     .1               2.3          .07

3     video blurr                                        x902                    0                         0                         1                    .1                 .1           1.1

As you can see, this includes the original (raw) dataset of 3 records, then has 3 additional binary columns (corresponding to the 3 topics) and 3 additional raw topic score columns (also corresponding to the 3 topics).

At the moment, I am not sure why you would use both a START list (which says 'only use these terms') and a STOP list (which says 'do not use these terms') at the same time.  It seems that I would simply remove words from the START list that also occurred in the STOP list, and then use this new START list.  Would that help what you are trying to accomplish?

As far as importing synonyms, that is available at the Text Filter node.

Hope this helps!

Ask a Question
Discussion stats
  • 3 replies
  • 643 views
  • 0 likes
  • 2 in conversation