Scoring text strings in big dataset

Ullsokk · Posted 12-13-2018 03:56 AM

I have a dataset that contains a list of about 8000 words, with a score fore each word. I want to run through a dataset of up to 3-4 million rows, and score each line based on the occurence of these words. The text string to be searched is max 32 characters long, but can contain several words. To do this I basiacally use a data step to write the code, put it to a file, and then include the file in another datastep:

filename inc1 "C:\temp\inc1.txt";

data _null_;
file inc1;
set textscores;
string = "IF index(upcase(text),'"||strip(upcase(text))||"') > 0  then score = score + ("||strip(score)||");";
put string;
run;

data score;
set textdata;
score = 0;
%include inc1;
run;

I have tried it out on a list of only 100 words, and on a dataset of 2.7 million rows. This takes about 4 minutes to complete. If the time use increases in a linear fashion, I am looking at a run time of five hours. Ideally, I want to do this for 10-20 different score files, and on a dataset that might be double the size (maybe 7-8 million rows), in a batch that needs to complete in reasonable time.

Is there a smarter and faster way to get to this score? I can trim it down to only score the most relevant words, but if possible, I would just use the whole data set.

RW9 · Posted 12-13-2018 04:26 AM

Its is always a good idea to post test data in the form of a datastep:
https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

And what you want to see out at the end.

One simple way of speeding it up might be to use prx to find all word:

http://support.sas.com/kb/38/719.html

So something like:

data _null_;
  set textscores end=last;
  length txt scr $2000;
  retain txt scr;
  txt=catx("|",txt,text);
  scr=catx("|",scr,score);
  if last then do;
    call symputx("str",txt);
    call symputx("scr",scr);
  end;
run;

data score;
  set textdata;
  if prxmatch("m/&str./oi",charvar) > 0 then found=1;
run;

Note that this only sets a flag found, but it maybe a quicker way of identifying records.

It really depends on your data, maybe use the above to quickly assign a filter, then filter off this data to process from the large data, and process the smaller data further.

Kurt_Bremser · Posted 12-13-2018 04:30 AM

Load your lookup table into a hash object:

data score_values;
input word $ score;
cards;
AAA 5
BBB 3
CCC 4
;
run;

data have;
input string $32.;
cards;
first string with AAA
another with BBB and CCC
;
run;

data want;
set have;
length
  word $8
  score 8
;
if _n_ = 1
then do;
  declare hash sc(dataset:'work.score_values');
  sc.definekey('word');
  sc.definedata('score');
  sc.definedone();
  call missing(word,score);
end;
total_score = 0;
do i = 1 to countw(string);
  word = scan(string,i);
  if not sc.find() then total_score + score;
end;
drop score word i;
run;

proc print data=want noobs;
run;

Result:

                            total_
         string              score

first string with AAA          5  
another with BBB and CCC       7

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

andreas_lds · Posted 12-13-2018 07:11 AM

Example data taken from @Kurt_Bremser post. This solution follows Maxim 8 "There is a format for it." I don't think this one is faster than the solution using a hash-object.

data score_values;
   input word $ score;
   cards;
AAA 5
BBB 3
CCC 4
;
run;

data score_format;
   set score_values(rename=(word=start score=label)) end=jobDone;
   length FmtName $ 32 Type HLO $ 1;
   retain FmtName 'ScoreFmt' Type 'I' HLO ' ';
   output;

   if jobDone then do;
      HLO = 'O';
      Label = 0;
      output;
   end;
run;

proc format cntlin=score_format;
run;

data have;
   input string $32.;
   cards;
first string with AAA
another with BBB and CCC
;
run;

data want;
   set have;
   length
      word $8
      total_score 8
   ;

   total_score = 0;

   do i = 1 to countw(string);
      word = scan(string, i);
      total_score = total_score + input(word, ScoreFmt.);
   end;

   drop i word;
run;

Kurt_Bremser · Posted 12-13-2018 07:40 AM

I'd say that the decision between formats or hash objects should be driven by personal preference and with what one is (more) familiar.

Both avoid any sorting and joining and do the main work in one sequential pass through the big dataset, so the difference in performance should not be significant.

Determine relative performance in the good old empirical way (Maxim 4).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

s_lassen · Posted 12-13-2018 07:59 AM

You have had several interesting suggestions, that are very elegant programming. But I do not think any of them are faster than your own approach.

I only see one small efficiency glitch: instead of calling the UPCASE function repeatedly, create a temporary variable with the upcase string:

filename inc1 "C:\temp\inc1.txt";

data _null_;
file inc1;
set textscores;
string = "IF index(_text_),'"||strip(upcase(text))||"') > 0  then score = score + ("||strip(score)||");";
put string;
run;

data score;
set textdata;
score = 0;
_text_=upcase(text);
drop _text_;
%include inc1;
run;

Astounding · Posted 12-13-2018 10:26 AM

With one important change needed ... INDEX is the wrong function. You need to search for words, not strings. If you search for "RAP", you don't want to find a match if your longer text string contains "RAPTURE" or "RAPACIOUS" (or many other possibilities).

ErikLund_Jensen · Posted 12-13-2018 03:24 PM

Hi Ullsokk

In my experience SQL join works pretty fast, so here is a different approach. I don't know if it can compete with the hash look-up method in speed, but it is very simple and easy to understand and explain, and optimizing is easy, ex by cleaning common and uninteresting words out in step 2 to reduce the size of the join data set.

/*----------------------- create test data ----------------------*/
* TEST score dataset;
data score_values;
	test = 'AAA'; score = 5; output;
	test = 'BBB'; score = 3; output;
	test = 'CCC'; score = 4; output;
run;

* TEST input data set - must contain ID-variable;
data test_strings;
length ID 8 str $32;
	ID = 1; str = 'He stays at AAA'; output;
	ID = 2; str = 'He drove from AAA to BBB'; output;
	ID = 3; str = 'He drove from CCC to DDD'; output;
	ID = 4; str = 'He moved to DDD'; output;
	ID = 5; str = 'He drove from AAA to CCC via BBB'; output;
	ID = 6; str = 'AAA and AAA and AAA'; output;
run;

/*------------------------- processing --------------------------*/
* 1. Split test_strings in words - keep only ID and word ;
data wteststrings (keep=ID word); set test_strings;
	length word $12;
	do i = 1 to countw(str);
		word = scan(str,i,' ');
		output;
	end;
run;

* 2. Join input and score datasets to get a score for each word;
proc sql;
	create table scores1 as
		select 
			a.ID,
			a.word,
			b.score
		from wteststrings2 as a
		left join score_values as b
		on a.word = b.test
		where b.score > 0;
quit;

* Summarize scores pr. ID;
proc sql;
	create table scores2 as
		select distinct 
			ID,
			sum(score) as score
		from scores1
		group by ID;
quit;

* Join sum back on input to include strings without any score;
proc sql;
	create table result as
		select 
			a.ID,
			a.str,
			max(b.score,0) as score
		from wteststrings1 as a, scores2 as b
		where a.ID = b.ID;
quit;

Kurt_Bremser · Posted 12-14-2018 02:01 AM

I wouldn't be so optimistic about doing SQL joins (or any joins).

They force SAS to do a sort on the data from the big dataset.

Both the format and the hash method build a search tree for the lookup in memory, and do not need to do any sort on the large dataset at all; as soon as the large dataset exceeds a certain size, you will see a noticeable performance difference, which can increase to several orders of magnitude with dataset size.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ErikLund_Jensen · Posted 12-16-2018 07:49 AM

Hi KurtBremser

After running a full-scale test, I still advocate the sql approach.

I made a score data set with 8000 different words, and a string data set with 6 million strings containing from 3 to 8 words, of wich zero to 8 were taken from the from the score list.

As you can se in the following log, the crucial part, joining 24 million words from the "outfolded" strings with the score list, used 7 cpu seconds and took 12 seconds in lapse time, and the whole operation took less than 25 seconds in lapse time.

The hash object approach is certainly more elegant in terms of coding, and it might even be a few seconds faster, but the brute force way is a lot easier ti explain and debug.

565   * 0. Add ID variable;
566   data wteststrings1; set test_strings;
567       ID = _N_;
568   run;

NOTE: There were 6000000 observations read from the data set WORK.TEST_STRINGS.
NOTE: The data set WORK.WTESTSTRINGS1 has 6000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.47 seconds
      cpu time            0.48 seconds


569
570   * 1. Split test_strings in words - keep only ID and word ;
571   data wteststrings2 (keep=ID word); set wteststrings1;
572       ID = _N_;
573       length word $4;
574       do i = 1 to countw(str);
575           word = scan(str,i,' ');
576           output;
577       end;
578   run;

NOTE: There were 6000000 observations read from the data set WORK.WTESTSTRINGS1.
NOTE: The data set WORK.WTESTSTRINGS2 has 24114600 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           1.99 seconds
      cpu time            2.00 seconds


579
580   * 2. Join input and score datasets to get a score for each word;
581   proc sql;
582       create table scores1 as
583           select
584               a.ID,
585               a.word,
586               b.score
587           from wteststrings2 as a
588           left join score_values as b
589           on a.word = b.word
590           where b.score > 0;
NOTE: Table WORK.SCORES1 created, with 13044600 rows and 3 columns.

591   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           6.93 seconds
      cpu time            12.18 seconds


592
593   * Summarize scores pr. ID;
594   proc sql;
595       create table scores2 as
596           select distinct
597               ID,
598               sum(score) as score
599           from scores1
600           group by ID;
NOTE: Table WORK.SCORES2 created, with 4800000 rows and 2 columns.

601   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           2.28 seconds
      cpu time            6.30 seconds


602
603   * Join sum back on input to include strings without any score;
604   proc sql;
605       create table result as
606           select
607               a.ID,
608               a.str,
609               max(b.score,0) as score
610           from wteststrings1 as a left join scores2 as b
611           on a.ID = b.ID;
NOTE: Table WORK.RESULT created, with 6000000 rows and 3 columns.

612   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           3.00 seconds
      cpu time            3.86 seconds

Kurt_Bremser · Posted 12-17-2018 03:24 AM

I just ran a comparison test (2 POWER8 cores, heavy concurrent load on the server because of year-end processing).

I increased my initial dataset size by multiplying the observations:

data have;
input string $32.;
do i = 1 to 100000;
  output;
end;
drop i;
cards;
first string with AAA
another with BBB and CCC
;
run;

Your method needed 7.5 seconds real time, the hash step 1.5.

Bottom line: If one does not have the luxury of a large multi-core server (your real vs. CPU time numbers point to that) with the corresponding amount of RAM (which probably allows to keep the whole dataset in memory), performance considerations will gain lots of weight.

While I totally concur with the "brute force" philosophy (see my Maxim 29), I also hold it that learning the workings of the hash object is a must when dealing with large data and complex relationships nowadays. There's a very good helper book out by SAS which I recommend: SAS Hash Object Programming Made Easy.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

s_lassen · Posted 12-14-2018 08:57 AM

What if you have more than one occurence of a given term in your string? If you want to sum the scores for all occurrences, and you want to check for whole words only, something like this may be a possibility:

filename tempsas temp;

data _null_;
  set textscores;
  file tempsas;
  put "'" text $upcase. "'=" score;
run;

proc format;
  invalue score
%include tempsas;
  other=0;
run;

data score;
  set textdata;
  start=1;
  pos=0;
  len=0;
  score=0;
  do until(0);
    call scan(substr(text,start),1,pos,len);
    if pos=0 then leave;
    start=start+pos-1;
    score=score+input(substr(text,start,len),score.);
    start=start+len;
    if start>=vlength(text) then leave;
    end;
run;

I used CALL SCAN on a substring, because that seems to work a lot faster than using the SCAN function with an increasing index.

Scoring text strings in big dataset

Re: Scoring text strings in big dataset

Re: Scoring text strings in big dataset

Re: Scoring text strings in big dataset

Re: Scoring text strings in big dataset

Re: Scoring text strings in big dataset

Re: Scoring text strings in big dataset

Re: Scoring text strings in big dataset

Re: Scoring text strings in big dataset

Re: Scoring text strings in big dataset

Re: Scoring text strings in big dataset

Re: Scoring text strings in big dataset

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!