Re: Joining the text topic node output with the input data

chekurti · Posted 09-11-2013 10:01 PM

Hi All,

I am looking for a way to join the text topic node output with the raw data so that i can see each document is falling under which topic. Let me give some example.My input data looks like:

id text product

1 Amber blink continuously x142

2 Memory space problem x189

3 video blurr x902

The text topic node gave me the output as

topic id topic
1 memory

2 amber

3 video

I want to create the data set combing the raw data and text topic node results like

id text product topic

1 Amber blink continuously x142 amber

2 Memory space problem x189 memory

3 video blurr x902 video

I checked the intermediate dataset which text topic nodes but no dataset has the raw data + topic node results.

Can you guys please help me with this.

Thanks in Advance

Srini

chekurti · Posted 09-12-2013 03:20 AM

Help please

chekurti · Posted 09-13-2013 08:04 AM

I tried used the Proc TGPARSE to extract the certain keywords from the text and able to join back back with the RAW data. But i am not able to control the text mining.Like i am not able to use stoplist and start list together.whether there is any option of passing the synonymms.

JustinPlumley · Posted 09-26-2013 11:46 AM

Hi,

The dataset exported from the Text Topic Node includes the original (raw) data plus includes new variables associated with the topics. Simplifying slightly, the dataset exported from the Text Topic Node will contain additional binary indicators (whether the record belongs to the topic or not) as well as raw scores (like a projection onto that topic - mocked values in the example below):

id text product _1_0_amber _1_0_memory _1_0_video amber memory video

1 Amber blink continuously x142 1 0 0 .9 .05 .08

2 Memory space problem x189 0 1 0 .1 2.3 .07

3 video blurr x902 0 0 1 .1 .1 1.1

As you can see, this includes the original (raw) dataset of 3 records, then has 3 additional binary columns (corresponding to the 3 topics) and 3 additional raw topic score columns (also corresponding to the 3 topics).

At the moment, I am not sure why you would use both a START list (which says 'only use these terms') and a STOP list (which says 'do not use these terms') at the same time. It seems that I would simply remove words from the START list that also occurred in the STOP list, and then use this new START list. Would that help what you are trying to accomplish?

As far as importing synonyms, that is available at the Text Filter node.

Hope this helps!