BookmarkSubscribeRSS Feed
AlexeyS
Pyrite | Level 9

Hello, i have a general question about analytics of text mining.

I want to understand how to work with big text data. my aim from text data to get topics.(for example model PLSA)

assume i have a data of more then 1 million rows, and two variables : document id and story( every value is text - maybe more than 15000 words).

i need from every document to extract words. when i do it i got dataset with number of rows of more than 500 million.

example :

documentid   word

1                        i 

1                      am

1                     work

2                       do 

2                      you

2                    work

2                      in

 

what i should do? maybe to build an array(but still it can be very long)? how can i do text mining in SAS? what is right way to build a data? 

Thank you

6 REPLIES 6
SASKiwi
PROC Star

This is not really a big data problem. Even a PC can easily text mine the volume of data you have.

 

I suggest you start by trying a few very simple searches first to see what the performance is like and to get the word searching working correctly:

 

data want;

  set have;

  if indexw(upcase(story), 'PLSA');

run;

 

Once your program is finding single words OK, then you can enhance it to do a list of words.

AlexeyS
Pyrite | Level 9

thank you for your answer. i did what you told me. i do my capstone on sas university edition and i saw that 1000000 number of rows it's a big number for it. in your opinion the technique i used to build a word list and frequency for every document as column is right?

AlexeyS
Pyrite | Level 9

maybe you doon't understand me, i don't look for the word PLSA, PLSA IS A model for topic mining. for every story i extract all the words as column vector and label it as number of documet, then i go to next row and do the same. the total number of rows is about 500 millions. this is a number of total words. then i go the the algorithm of topic mining and simulation.

SASKiwi
PROC Star

OK, I misunderstood and thought you were doing simple word searches. What you want sounds like much more advanced techniques. Why not check out what SAS Text Miner can do? If it can do what you want, then why build it yourself?

AlexeyS
Pyrite | Level 9

hi, yes, this is advanced technique. i can't to use it because the aim is to develop models by myself.

i really don't understand ow can i do it in SAS? maybe SAS is not suitable platform for doing this?

SASKiwi
PROC Star

Perhaps this link maybe useful to you:

 

http://blogs.sas.com/content/sasdummy/2010/05/27/a-topical-topic-how-text-mining-determines-topics/

 

I also suggest you search for other text mining resources yourself on the SAS Support site.

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 2131 views
  • 0 likes
  • 2 in conversation