Here is a problem submitted to the sagacity of this forum's users at https://communities.sas.com/t5/Base-SAS-Programming/Big-Data-Cartesian/td-p/288303 .
The participants’ attempts at improvements and their findings are summarised here as an example of improving a slow process. Special thanks to @JohnHoughton and @KurtBremser for their contributions.
The goal is to find trademark brands in a large amount of text. The brands and the text are in two tables, with respectively 25 million brands and 400,000 phrases. The SAS user seeking help states that the process takes one month (!) to complete using a standard join using proc SQL.
proc sql; create table FINAL as select * from PHRASES p inner join BRANDS b on trim(p.PHRASE) contains trim(b.BRAND); quit;
The first phase is to build sample data in order to test various algorithms without having to wait one month per test. Building test data is also necessary since the real data is of course proprietary.
Here is the sample data that we’ll use, with 8,000 brands and 2,000 phrases:
%* Create brand sample; data BRANDS (keep=BRAND); length BRAND $ 60; %* Add a few usable brand name strings; BRAND="Apple" ; output; BRAND="Apples" ; output; BRAND="ROBINSONS Fruit Shoot Apple & Blackcurrant" ; output; BRAND="Coca-Cola" ; output; BRAND="Head & Shoulders" ; output; do I=0 to 2e3; %* Brand iterations, is multiplied by J; BRAND=''; do J=1 to 4; %* Make 1-word to 4-word unique brands; BRAND=catx(' ', BRAND, put(ranuni(0),14.12)); output; end; end; run; %* Create phrase sample; data PHRASES (keep=PHRASE PHRASE_ID); length PHRASE $ 2000; do PHRASE_ID=1 to 2e3; %* Nb of phrases; PHRASE=''; do J=1 to 25; %* Nb of words per phrase; PHRASE=catx(' ', PHRASE, put(ranuni(0)*10,14.12) ); if ranuni(0)>.95 then do; %* Sprinkle one or more usable brand; POINT=int(ranuni(0)*5+1); set BRANDS point=POINT; PHRASE=catx(' ', PHRASE, BRAND); end; end; output; end; stop; run;
The first question was to benchmark if the contains operator could be replaced with a function such as index() or find().
The findings are that function index(trim()) has similar performance to the SQL operator contains, while function find() is slower regardless whether values are trimmed using the trim() function or the ‘t’ parameter.
The second finding is that the process is CPU-bound. We know this because the CPU time is similar to or greater than the real time. With the sample data, both times when running the default SQL match are 22 seconds on out server. Since the process is CPU-bound, there is so no point trying to optimise I/Os through better use of RAM or buffers. The matching method itself must be changed to something more efficient.
Different methods for matching the strings are tried. One is to use arrays in a data step.
data OUT_DS_ARRAY; array BRANDS  $60 _temporary_; if _N_=1 then do K=1 to 8009; set BRANDS; BRANDS[K]=BRAND; end; set PHRASES; do K=1 to 8009; if index(trim(PHRASE), trim(BRANDS[K])) then do; BRAND=BRANDS[K]; output; end; end; drop K; run;
This provides an improvement with CPU and real times of 12 seconds.
Another method is to point to each BRAND record directly in the BRANDS table.
sasfile BRANDS load; data OUT_DS_POINT; set PHRASES; do K=1 to 8009; set BRANDS point=K; if index(trim(PHRASE), trim(BRAND)) then output; end; drop K; run; sasfile BRANDS close;
This is slower at around 28 seconds regardless whether sasfile is used to load the data in memory.
The next test uses a hash table.
data OUT_DS_HASH; set PHRASES; if _N_=1 then do; dcl hash BRANDS(dataset:'BRANDS'); BRANDS.definekey('BRAND'); BRANDS.definedata('BRAND'); BRANDS.definedone(); declare hiter IT('BRANDS'); if 0 then set BRANDS; end; RC=IT.first(); do while (RC = 0); if index(trim(PHRASE), trim(BRAND)) then output; RC=IT.next(); end; drop RC; run;
This provides good times too with CPU and real times of 13 seconds.
Another effort, though not very scalable, involves reading values from macro variables instead of reading them from an array. This doesn't best other methods with the test data used here.
data _null_; set PHRASES; call symputx(cat('PHRASE_',_N_),PHRASE); run; options noquotelenmax; data OUT_DS_MACRO; set BRANDS; length PHRASE $2000; %macro loop; %local i; %do i=1 %to 2000; if index("&&phrase_&i", trim(BRAND)) then do; PHRASE="&&phrase_&i"; output; end; %end; %mend; %loop; run;
This yields CPU and real time of 17 seconds.
Note that in the actual original thread, macro variables were fastest as the test data was slightly different.
As usual, only one’s actual data on one’s actual hardware will result in the most accurate benchmark.
So far, we have managed to improve the original run time, but not that much. If the improvement scales in a proportional manner using the original data, we can still expect two weeks of processing to perform the full table join.
A different approach is needed. One approach is to change the match criterion to a simple (and fast) equality rather than on using function index(), which implies splitting the phrases into all the possible brands they could contain. This is what the following example does:
1 - Derive the maximum number of words that can make up a brand.
2 - Split the phrases into all combinations of successive words, up to the maximum number of words that a brand can include. This will fabricate all fake possible brands in a phrase.
3 - Match this multitude of phrase fragment to a brand with a simple equality match.
%let time=%sysfunc(time()); %* Find the maximum word count in brand names; proc sql noprint; select max(countw(BRAND, ' ')) into :max_brand_word_count from BRANDS; quit; %put max_brand_word_count=&max_brand_word_count; %* Extract consecutive phrase words to build possible brands from the phrases; data POSSIBLE_BRAND_STRINGS (keep= PHRASE_ID POSSIBLE_BRAND ); set PHRASES; length POSSIBLE_BRAND $200; PHRASE_WORD_NB=countw(PHRASE, ' '); do WORD_NB_IN_PHRASE=0 to PHRASE_WORD_NB-1; POSSIBLE_BRAND=''; do NB_WORDS_IN_FAKE_BRAND=1 to &max_brand_word_count.; POSSIBLE_BRAND=catx(' ', POSSIBLE_BRAND, scan(PHRASE, WORD_NB_IN_PHRASE+NB_WORDS_IN_FAKE_BRAND, ' ')); if POSSIBLE_BRAND ne lag(POSSIBLE_BRAND) then output; end; end; run; %* Match possible brands to actual brands and retrieve full phrase; proc sql _method; create table OUT_SQL_BREAK_DOWN as select unique ph.PHRASE_ID, ph.PHRASE, br.BRAND from BRANDS br inner join POSSIBLE_BRAND_STRINGS pb on br.BRAND = pb.POSSIBLE_BRAND left join PHRASES ph on pb.PHRASE_ID = ph.PHRASE_ID order by ph.PHRASE_ID; quit; %put time=%sysevalf(%sysfunc(time())-&time);
This brings the time down to just one second. This is hugely better. We can now expect a reasonable running time on the full set of data.
If we look at the match count, we realise that there are fewer matches with this method than with the others. Could it be that it is making mistakes? Quite the opposite in fact. For example, phrase Apples no longer matches brand Apple. The original matching method was flawed, and this one is more accurate.
We still make erroneous matches though. If the phrase contains "ROBINSONS Fruit Shoot Apple & Blackcurrant", one of the fake brands is "Apple", which will match brand Apple. We need to remove these false matches. This can be done by checking the matches in a post-processing step to ensure that some matches are not embedded in larger matches. This final piece of code does that. It looks for the position of the matches, and if some matches are inside other matches they are removed.
%* Extract consecutive phrase words to build possible brands from the phrases. ; %* Keep position of matched brand in phrase; data POSSIBLE_BRAND_STRINGS (keep= PHRASE_ID POSSIBLE_BRAND START END); set PHRASES; length POSSIBLE_BRAND $200; PHRASE_WORD_NB=countw(PHRASE, ' '); START=0; do WORD_NB_IN_PHRASE=0 to PHRASE_WORD_NB-1; POSSIBLE_BRAND=''; do NB_WORDS_IN_FAKE_BRAND=1 to &max_brand_word_count.; POSSIBLE_BRAND=catx(' ', POSSIBLE_BRAND, scan(PHRASE, WORD_NB_IN_PHRASE+NB_WORDS_IN_FAKE_BRAND, ' ')); END=START+length(POSSIBLE_BRAND); if POSSIBLE_BRAND ne lag(POSSIBLE_BRAND) then output; end; START+1+length(scan(PHRASE, WORD_NB_IN_PHRASE+1,' ')); end; run; %* Match possible brands to actual brands and retrieve full phrase; proc sql _method; create table OUT_SQL_BREAK_DOWN as select ph.PHRASE_ID, ph.PHRASE, br.BRAND, pb.START, pb.END from BRANDS br inner join POSSIBLE_BRAND_STRINGS pb on br.BRAND = pb.POSSIBLE_BRAND left join PHRASES ph on pb.PHRASE_ID = ph.PHRASE_ID order by ph.PHRASE_ID, BRAND, START; quit; %* Flag embedded brands; proc sql; create table EMBEDDED_BRAND as select unique a.PHRASE_ID, a.BRAND, a.START from OUT_SQL_BREAK_DOWN a , OUT_SQL_BREAK_DOWN b where a.PHRASE_ID eq b.PHRASE_ID and b.START <= a.START and a.END <= b.END and a.BRAND ne b.BRAND order by a.PHRASE_ID, BRAND, START; quit; %* Final output: valid matches only; data OUT_SQL_BREAK_DOWN2; merge OUT_SQL_BREAK_DOWN EMBEDDED_BRAND (in=EMBEDDED_BRAND); by PHRASE_ID BRAND START; if not EMBEDDED_BRAND; run;
Here we go! We have divided the processing time by well over an order of magnitude while improving the process’ accuracy.
The final solution does not use a standard joining method. Because the join is atypical, we had to find an atypical way to join the tables.
I hope this article will be useful and will trigger brain waves so you keep looking for solutions when a process doesn't satisfy your requirements. Good performance is a key component of what makes a quality SAS process.
If you liked this article, you will love this book:
It is dedicated to improving the speed of SAS processes and it is full of performance tips, explanations and benchmarks.