I have a dataset that contains a list of about 8000 words, with a score fore each word. I want to run through a dataset of up to 3-4 million rows, and score each line based on the occurence of these words. The text string to be searched is max 32 characters long, but can contain several words. To do this I basiacally use a data step to write the code, put it to a file, and then include the file in another datastep: filename inc1 "C:\temp\inc1.txt";
data _null_;
file inc1;
set textscores;
string = "IF index(upcase(text),'"||strip(upcase(text))||"') > 0 then score = score + ("||strip(score)||");";
put string;
run;
data score;
set textdata;
score = 0;
%include inc1;
run; I have tried it out on a list of only 100 words, and on a dataset of 2.7 million rows. This takes about 4 minutes to complete. If the time use increases in a linear fashion, I am looking at a run time of five hours. Ideally, I want to do this for 10-20 different score files, and on a dataset that might be double the size (maybe 7-8 million rows), in a batch that needs to complete in reasonable time. Is there a smarter and faster way to get to this score? I can trim it down to only score the most relevant words, but if possible, I would just use the whole data set.
... View more