Hi. I wasn't sure if @ChrisNZ hash object solution has been tested on datasets as big as the original post specified, so here is another option. This solution runs in less than 7 seconds for sample data with 100 thousand individual brands and 300 phrases of around 25 words each. With the dataset sizes mentioned in the original post (25 million individual brands and 400 thousand phrases), it ran in under 20 minutes, using 5 word phrases. I am using SAS OnDemand for acedemics , so not sure what slice of the cpu I get. It is similar to @Kurt_Bremser suggestion, but can handle brand names with multiple words. The sample "BRANDS" data table contains brand names with between 1 & 5 words, all beginning with 'b' (so they can be easily spotted in the phrases). I've included some familiar brands too. The sample "PHRASES" data table is built using a subset of the "BRANDS" table. 1 in 3 phrases contains a brand. Each phrase contains a maximum of one brand. The number of words in each phrase can be adjusted using the "do i=" loop To avoid confusing brand 'b1 b2 b3' with brand 'b1 b2' or brand 'b1', the program searches for brands with the longest word count first (that is 5 words in the examples). It splits the phrases up in to 5 word portions, and merges with the 5 word brand names. It then deletes the found brand names from the phrases. Then it looks for brands with 4 words, deletes those from the phrases , 3 words and so on. First building the sample data tables *create a sample brands data table with brand names of between 1 & 5 words;
data brands (drop=i j);
length brand $ 60;
brand="Apple";output;
brand="Apples";output;
brand="Applebee";output;
brand="Coca-Cola";output;
brand="Head & Shoulders";output;
do i=0 to 2E4-1 by 1; *number of brands in data table will be 5 times this number;
*change to 5E6 to get 25million individual brandnames;
brand='';
do j=1 to 5;
brand=strip(brand)||' b'||strip(put(i+j, 10.));
brand=strip(brand);
output;
end;
end;
run;
*create the sample phrases data table with rows defined by the obs= option;
*1 in 3 phrases wil contain a brand name;
*phrases have 5 words + a brand name;
data phrases (drop=i brand) ;
set brands (keep=brand obs=300);
length phrase $ 2000;
phrase='';
do i=1 to 5;*use this to control the number of words in each phrase;
phrase=catx('', phrase, put(floor((1E8)*Rand("Uniform")), 12.));
if i=3 and (_N_<=5 or mod(_N_,3)=0) then
phrase=catx('', phrase, brand);
end;
output;
run; Then the program *create a column to hold a copy of the original phrase, which will be editted; data phrases_2; set phrases; length phrasecopy $ 2000; phrasecopy=phrase; phraseid=_N_; run; *add a brand word count to the brand data set; data brands_2; set brands; brandwc=countw(brand, ' '); run; *Need to know the maximum word count in brand name; proc sql noprint; select max(brandwc) into :brandmaxwc from brands_2; quit; %put &brandmaxwc; /*delete any existing 'matches' datatable*/ proc datasets lib=work nolist; delete matches; quit; %macro repeat; %do phrasepartwords=&brandmaxwc %to 1 %by -1; *Then split phrases into word portions with &phrasepartwords words; data phrasesplit (drop=i j wrds phrasewc ); set phrases_2 (drop=phrase); length partphrase $ 200; phrasewc=countw(phrasecopy, ' '); wrds=&phrasepartwords; do i=0 to phrasewc-wrds; partphrase=''; do j=1 to wrds; partphrase=catx(' ', partphrase, scan(phrasecopy, i+j, ' ')); end; output; end; run; proc sql ; create table temp as select p.phrasecopy, p.phraseid,b.brand from phrasesplit p inner join brands_2 (where =(brandwc=&phrasepartwords)) b on p.partphrase=b.brand order by phraseid; run; quit; data temp; set temp; phrasecopy=transtrn(phrasecopy, strip(brand), ''); run; proc append base=matches data=temp (drop=phrasecopy); *Update phrase_2 data with the matches. This removes the found brand name from phrasecopy, ready for the next iteration; data phrases_2 ; merge phrases_2 temp (drop=brand); by phraseid; run; %end; %mend; %repeat; proc sort data=matches; by phraseid; run; data want; merge phrases_2 (drop=phrasecopy) matches (in=a); by phraseid; if a ; run; I am running on SAS OnDemand so benchmarking is awkward. Try running with the sample data sets as they are and hopefully it will complete in less than 7 seconds as it did for me (running it on SAS OnDemand). Then change the BRANDS loop to 5E5 ( to get 25 million brands), and the PHRASES obs= option to obs=4E5 ( to get 400 thousand phrases), and I hope this will complete in less than 10 minutes. Then can try increasing the number of words in the phrases, or try with a subset of your real world data If this works for you, then with extra effort it can be extended to look for more than one brand in a phrase. Add another loop within the current loop that repeats until the number of observations in data table 'TEMP' is zero. Edited Sunday 8th Aug - correction, the code already works for more than one brand in a phrase. Edited Friday 5th Aug Just ran (using Sas OnDemand) for 25 million brands, 400 thousand phrases of 25 words + brandnames. Took about 90 minutes real-time ! Log file attached.
... View more