09-10-2015 02:25 AM
Hello, i have a general question about analytics of text mining.
I want to understand how to work with big text data. my aim from text data to get topics.(for example model PLSA)
assume i have a data of more then 1 million rows, and two variables : document id and story( every value is text - maybe more than 15000 words).
i need from every document to extract words. when i do it i got dataset with number of rows of more than 500 million.
what i should do? maybe to build an array(but still it can be very long)? how can i do text mining in SAS? what is right way to build a data?
09-10-2015 03:15 AM
This is not really a big data problem. Even a PC can easily text mine the volume of data you have.
I suggest you start by trying a few very simple searches first to see what the performance is like and to get the word searching working correctly:
if indexw(upcase(story), 'PLSA');
Once your program is finding single words OK, then you can enhance it to do a list of words.
09-10-2015 12:11 PM
thank you for your answer. i did what you told me. i do my capstone on sas university edition and i saw that 1000000 number of rows it's a big number for it. in your opinion the technique i used to build a word list and frequency for every document as column is right?
09-10-2015 12:15 PM
maybe you doon't understand me, i don't look for the word PLSA, PLSA IS A model for topic mining. for every story i extract all the words as column vector and label it as number of documet, then i go to next row and do the same. the total number of rows is about 500 millions. this is a number of total words. then i go the the algorithm of topic mining and simulation.
09-10-2015 08:34 PM
OK, I misunderstood and thought you were doing simple word searches. What you want sounds like much more advanced techniques. Why not check out what SAS Text Miner can do? If it can do what you want, then why build it yourself?
09-11-2015 12:44 AM
hi, yes, this is advanced technique. i can't to use it because the aim is to develop models by myself.
i really don't understand ow can i do it in SAS? maybe SAS is not suitable platform for doing this?
09-15-2015 12:03 AM
Perhaps this link maybe useful to you:
I also suggest you search for other text mining resources yourself on the SAS Support site.