Text mining and content categorization

text mining - big data

Reply
Contributor
Posts: 58

text mining - big data

Hello, i have a general question about analytics of text mining.

I want to understand how to work with big text data. my aim from text data to get topics.(for example model PLSA)

assume i have a data of more then 1 million rows, and two variables : document id and story( every value is text - maybe more than 15000 words).

i need from every document to extract words. when i do it i got dataset with number of rows of more than 500 million.

example :

documentid   word

1                        i 

1                      am

1                     work

2                       do 

2                      you

2                    work

2                      in

 

what i should do? maybe to build an array(but still it can be very long)? how can i do text mining in SAS? what is right way to build a data? 

Thank you

Respected Advisor
Posts: 2,989

Re: text mining - big data

This is not really a big data problem. Even a PC can easily text mine the volume of data you have.

 

I suggest you start by trying a few very simple searches first to see what the performance is like and to get the word searching working correctly:

 

data want;

  set have;

  if indexw(upcase(story), 'PLSA');

run;

 

Once your program is finding single words OK, then you can enhance it to do a list of words.

Contributor
Posts: 58

Re: text mining - big data

thank you for your answer. i did what you told me. i do my capstone on sas university edition and i saw that 1000000 number of rows it's a big number for it. in your opinion the technique i used to build a word list and frequency for every document as column is right?

Contributor
Posts: 58

Re: text mining - big data

maybe you doon't understand me, i don't look for the word PLSA, PLSA IS A model for topic mining. for every story i extract all the words as column vector and label it as number of documet, then i go to next row and do the same. the total number of rows is about 500 millions. this is a number of total words. then i go the the algorithm of topic mining and simulation.

Respected Advisor
Posts: 2,989

Re: text mining - big data

OK, I misunderstood and thought you were doing simple word searches. What you want sounds like much more advanced techniques. Why not check out what SAS Text Miner can do? If it can do what you want, then why build it yourself?

Contributor
Posts: 58

Re: text mining - big data

hi, yes, this is advanced technique. i can't to use it because the aim is to develop models by myself.

i really don't understand ow can i do it in SAS? maybe SAS is not suitable platform for doing this?

Respected Advisor
Posts: 2,989

Re: text mining - big data

Perhaps this link maybe useful to you:

 

http://blogs.sas.com/content/sasdummy/2010/05/27/a-topical-topic-how-text-mining-determines-topics/

 

I also suggest you search for other text mining resources yourself on the SAS Support site.

 

 

Ask a Question
Discussion stats
  • 6 replies
  • 767 views
  • 0 likes
  • 2 in conversation