Big Data - Cartesian - Page 2

ChrisNZ · Posted 08-01-2016 07:10 AM

Haha! Good man! On to fixing this! I shouldnt try to think so late!!

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 08-01-2016 07:20 AM

The hash table doesn't shine so,much now that I test properly.

Back to square one....

227  options fullstimer;
228
229  data PHRASES;
230    length PHRASE $2000;
231    PHRASE=repeat('x',1000);
232    do I=1 to 300; output; end ;
233   run;

NOTE: The data set WORK.PHRASES has 300 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      user cpu time       0.00 seconds
      system cpu time     0.01 seconds
      memory              186.10k
      OS Memory           10864.00k
      Timestamp            1/08/2016 11:16:49 PM


234
235  data BRANDS;
236    length BRAND $40;
237    BRAND=repeat('x',25); output;
238    do J=1 to 1e5; BRAND=put(J,20.);output; end ;
239  run;

NOTE: The data set WORK.BRANDS has 100001 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      user cpu time       0.01 seconds
      system cpu time     0.01 seconds
      memory              184.37k
      OS Memory           10864.00k
      Timestamp            1/08/2016 11:16:49 PM


240
241  proc sql;
242    create table OUT_SQL_CONTAINS as
243    select * from PHRASES, BRANDS
244    where trim(PHRASES.PHRASE) contains trim(BRANDS.BRAND);
NOTE: The execution of this query involves performing one or more Cartesian product joins that
      can not be optimized.
NOTE: Table WORK.OUT_SQL_CONTAINS created, with 300 rows and 4 columns.

245  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           40.69 seconds
      user cpu time       40.67 seconds
      system cpu time     0.01 seconds
      memory              1042.18k
      OS Memory           10864.00k
      Timestamp            1/08/2016 11:17:29 PM


246
247  data OUT_DS_HASH;
248    set PHRASES;
249    if _N_=1 then do;
250      dcl hash BRANDS(dataset:'BRANDS');
251      BRANDS.definekey('BRAND');
252      BRANDS.definedata('BRAND','J');
253      BRANDS.definedone();
254      declare hiter IT('BRANDS');
255      if 0 then set BRANDS;
256    end;
257    RC=IT.first();
258    do while (RC = 0);
259      if index(trim(PHRASE), trim(BRAND)) then output;
260      RC=IT.next();
261    end;
262    drop RC;
263  run;

NOTE: There were 100001 observations read from the data set WORK.BRANDS.
NOTE: There were 300 observations read from the data set WORK.PHRASES.
NOTE: The data set WORK.OUT_DS_HASH has 300 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           43.08 seconds
      user cpu time       43.02 seconds
      system cpu time     0.04 seconds
      memory              13795.31k
      OS Memory           23820.00k
      Timestamp            1/08/2016 11:18:12 PM

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 08-01-2016 07:26 PM

Summary of the current status:

- The process is CPU-bound

- Hash table can be made slightly faster than SQL by having a one-byte key (see next post)

- Array match is still the fastest.

Recommendations until some has a better idea, and provided the sample data tested is close enough to the actual data:

- Use arrays rather than SQL

- Split the job into small chunks that can run in parallel so you can use more CPUs

- If your brands have a length that vary from say 20 to 40, you could make one job for each length and remove the need for trimming the brand by making the variable length for each job the exact needed length, like this:

data CLASS4 (keep=NAME_4 rename=NAME_4=NAME where=(NAME ne '')) 
     CLASS5 (keep=NAME_5 rename=NAME_5=NAME where=(NAME ne ''))  
     CLASS6 (keep=NAME_6 rename=NAME_6=NAME where=(NAME ne ''))   
     CLASS7 (keep=NAME_7 rename=NAME_7=NAME where=(NAME ne ''))  ;
  length NAME_4 $4   NAME_5 $5   NAME_6 $6   NAME_7 $7;
  array NAME_[4:7] $8 NAME_4 - NAME_7;
  set SASHELP.CLASS; 
  NAME_[length(NAME)]=NAME;         
run;

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 08-01-2016 07:27 PM

Latest best run times for SQL, Array match, Hash match (times are diffrent as the machine is different, the hierarchy is the same):

 options fullstimer;

 data PHRASES;
   length PHRASE $2000;
   PHRASE=repeat('x',1000);
   do I=1 to 300; output; end ;
  run;

 data BRANDS;          
   A='a';
   length BRAND $40 ;   
   BRAND=repeat('x',25); output;
   do J=1 to 1e5; BRAND=put(J,20.);output; end ;
 run;
          
 proc sql;          *****  real time 21.18 seconds *****;
   create table OUT_SQL_CONTAINS as
   select * from PHRASES, BRANDS
   where trim(PHRASES.PHRASE) contains trim(BRANDS.BRAND);
 quit;

 data OUT_DS_HASH;  ****  real time 20.30 seconds *****;
   set PHRASES;
   if _N_=1 then do;
     dcl hash BRANDS(dataset:'BRANDS', multidata: "Y");
     BRANDS.definekey('A');   * dummy one-char key;
     BRANDS.definedata('BRAND','J');
     BRANDS.definedone();
     declare hiter IT('BRANDS');
     if 0 then set BRANDS;
   end;
   RC=IT.first();
   do while (RC = 0);
     if index(trim(PHRASE), trim(BRAND)) then output;
     RC=IT.next();
   end;
   drop RC;
 run;
       
data OUT_DS_ARRAY;  ****  real time 19.14 seconds *****;
  array BRANDS [100000] $40 _temporary_;
  if _N_=1 then do K=1 to 100000;
    set BRANDS;
    BRANDS[K]=BRAND;
  end;
  set PHRASES;
  do K=1 to 100000;
    if index(trim(PHRASE), trim(BRANDS[K])) then do;
      BRAND=BRANDS[K];
      output;
    end;
  end;
  drop K;
run;

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 08-02-2016 12:11 AM

Another half second saved, but we seem to be scraping the bottom of the barrel.

By the way, for implementing parallel runs, MP Connect is probably the best method to try first. Shame that proc DS2 cannot multi-thread when processing SAS data.

 
data _null_;
  set PHRASES;
  call symputx(cat('PHRASE_',_N_),PHRASE);
run;
          
options noquotelenmax;
           
data OUT_DS_MACRO;  ****  real time 18.60 seconds *****;
   set BRANDS;
   length PHRASE $2000;
   %macro loop;
   %local i;
   %do i=1 %to 300;
     if index("&&phrase_&i", trim(BRAND)) then do;
       PHRASE="&&phrase_&i";
       output;
     end;
   %end;
   %mend;
   %loop;
 run;

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 08-02-2016 01:28 AM

Kudos for all the work, @ChrisNZ!

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 08-02-2016 06:10 PM

I just like performance optimisation.

A process that lasts one month! This pushes all my buttons!

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 07-30-2016 02:05 PM

For every word in every record in a, SQL will search through b. Ungood.

I'd try this:

Create a unique ID in dataset base (use _N_ in a data step, for example)

Create a new dataset from all the words in base.name_2 and the ID (will be a multiple of the 400k)

Sort intermediate dataset and dict by the trimmed words

Join and only keep matches (in a data step)

Sort result by ID with nodupkey

Join with base and only keep matches

Will be more code to write, but perform faster than your SQL by orders of magnitude.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 08-07-2016 05:57 AM

@JohnHoughton showed you were on the right track Kurt. Nice!

High-Performance SAS Coding - Third Edition

LinusH · Posted 07-31-2016 04:32 AM

So my gut feeling wasn't out of bounds...😎

Data never sleeps

JohnHoughton · Posted 08-05-2016 08:52 AM

Hi.

I wasn't sure if @ChrisNZ hash object solution has been tested on datasets as big as the original post specified, so here is another option.

This solution runs in less than 7 seconds for sample data with 100 thousand individual brands and 300 phrases of around 25 words each.

With the dataset sizes mentioned in the original post (25 million individual brands and 400 thousand phrases), it ran in under 20 minutes, using 5 word phrases. I am using SAS OnDemand for acedemics , so not sure what slice of the cpu I get.

It is similar to @Kurt_Bremser suggestion, but can handle brand names with multiple words.

The sample "BRANDS" data table contains brand names with between 1 & 5 words, all beginning with 'b' (so they can be easily spotted in the phrases). I've included some familiar brands too.

The sample "PHRASES" data table is built using a subset of the "BRANDS" table. 1 in 3 phrases contains a brand. Each phrase contains a maximum of one brand. The number of words in each phrase can be adjusted using the "do i=" loop

To avoid confusing brand 'b1 b2 b3' with brand 'b1 b2' or brand 'b1', the program searches for brands with the longest word count first (that is 5 words in the examples). It splits the phrases up in to 5 word portions, and merges with the 5 word brand names. It then deletes the found brand names from the phrases.

Then it looks for brands with 4 words, deletes those from the phrases , 3 words and so on.

First building the sample data tables

*create a sample brands data table with brand names of between 1 & 5 words;

data brands (drop=i j);
	length brand $ 60;
	brand="Apple";output;
	brand="Apples";output;
	brand="Applebee";output;
     brand="Coca-Cola";output;
     brand="Head & Shoulders";output;
     
	do i=0 to 2E4-1 by 1; *number of brands in data table will be 5 times this number;
	                      *change to 5E6 to get 25million individual brandnames;
		brand='';

		do j=1 to 5;
			brand=strip(brand)||' b'||strip(put(i+j, 10.));
			brand=strip(brand);
			output;
		end;
	end;
run;

*create the sample phrases data table with rows defined by the obs= option;
*1 in 3 phrases wil contain a brand name;
*phrases have 5 words + a brand name;

data phrases (drop=i brand) ;
	set brands (keep=brand    obs=300);
	length phrase $ 2000;
	phrase='';
	do i=1 to 5;*use this to control the number of words in each phrase;
		phrase=catx('', phrase, put(floor((1E8)*Rand("Uniform")), 12.));
		if i=3  and (_N_<=5 or mod(_N_,3)=0) then
			phrase=catx('', phrase, brand);
	end;
	output;
run;

Then the program


*create a column to hold a copy of the original phrase, which will be editted;

data phrases_2;
 set phrases;
 length phrasecopy $ 2000;
 phrasecopy=phrase;
 phraseid=_N_;
run;


*add a brand word count to the brand data set;
data brands_2;
 set brands;
 brandwc=countw(brand, ' ');
run;

*Need to know the maximum word count in brand name;
proc sql noprint;
select max(brandwc) into :brandmaxwc from brands_2;
quit;
%put &brandmaxwc;

/*delete any existing 'matches' datatable*/
proc datasets lib=work nolist;
 delete matches;
quit;

%macro repeat;
%do phrasepartwords=&brandmaxwc %to 1 %by -1;
*Then split phrases into word portions with &phrasepartwords words;

data phrasesplit (drop=i j wrds phrasewc );
 set phrases_2 (drop=phrase);
 length partphrase $ 200;
 phrasewc=countw(phrasecopy, ' ');
 wrds=&phrasepartwords;
 do i=0 to phrasewc-wrds;
 partphrase='';
 do j=1 to wrds;
 partphrase=catx(' ', partphrase, scan(phrasecopy, i+j, ' '));
 end;
 output;
 end;
run;
 

proc sql ;
create table temp as
select p.phrasecopy, p.phraseid,b.brand 
from phrasesplit p inner join brands_2 (where =(brandwc=&phrasepartwords)) b
on p.partphrase=b.brand
order by phraseid;
run;
quit;

data temp;
set temp;
phrasecopy=transtrn(phrasecopy, strip(brand), '');
run;


proc append base=matches data=temp (drop=phrasecopy);
*Update phrase_2 data with the matches. 
This removes the found brand name from phrasecopy,
ready for the next iteration;
data phrases_2 ;
merge phrases_2 temp (drop=brand);
by phraseid;
run;

%end;
%mend;
%repeat;

proc sort data=matches;
by phraseid;
run;
data want;
merge phrases_2 (drop=phrasecopy) matches (in=a);
by phraseid;
if a ;
run;

I am running on SAS OnDemand so benchmarking is awkward.

Try running with the sample data sets as they are and hopefully it will complete in less than 7 seconds as it did for me (running it on SAS OnDemand).

Then change the BRANDS loop to 5E5 ( to get 25 million brands), and the PHRASES obs= option to obs=4E5 ( to get 400 thousand phrases), and I hope this will complete in less than 10 minutes.

Then can try increasing the number of words in the phrases, or try with a subset of your real world data

If this works for you, then with extra effort it can be extended to look for more than one brand in a phrase. Add another loop within the current loop that repeats until the number of observations in data table 'TEMP' is zero.

Edited Sunday 8th Aug - correction, the code already works for more than one brand in a phrase.

Edited Friday 5th Aug

Just ran (using Sas OnDemand) for 25 million brands, 400 thousand phrases of 25 words + brandnames. Took about 90 minutes real-time ! Log file attached.

ChrisNZ · Posted 08-07-2016 03:11 AM

@JohnHoughton Very well done!

The hash table takes 4.5 seconds on my PC while your method take 0.4 seconds (for loops with I=1 to 1e3) .

Increasing the match volume 10,000 times by changing 1e3 to 1e5 in both tables' creation loop only takes 26 seconds.

Since the process is no longer CPU bound, further progress is possible. For example, using SPDE takes the time down to 21 seconds.

I simplified your match code to just 2 steps:

options fullstimer msglevel=i;
%* Create brand sample;
data BRANDS (keep=BRAND);
  length BRAND $ 60;
  BRAND="Apple"                            ;output;
  BRAND="Apples"                           ;output;
  BRAND="I Can’t Believe It’s Not Butter"  ;output;
  BRAND="Coca-Cola"                        ;output;
  BRAND="Head & Shoulders"                 ;output;
  do I=0 to 1e5;                                 %* Brand iterations. This is multiplied by J;
    BRAND='';
    do J=1 to 4;                                 %* Make 1-word to 4-word unique brands;
      BRAND=catx(' ', BRAND, put(ranuni(0),14.12));
      output;
    end;
  end;
run;

%* Create phrase sample;
data PHRASES (keep=PHRASE PHRASE_ID);
  length PHRASE $ 2000;
  do PHRASE_ID=1 to 1e5;                            %* Nb of phrases;
    PHRASE='';
    do J=1 to 20;                                   %* Nb of words per phrase;
      PHRASE=catx(' ', PHRASE, put(ranuni(0)*10,14.12) );
      if ranuni(0)>.95 then do;                     %* Sprinkle one or more brand;
        POINT=int(ranuni(0)*5+1);
        set BRANDS point=POINT;
        PHRASE=catx(' ', PHRASE, BRAND);
      end;
    end;
    output;
  end;
  stop;
run;         

%* Match using a hash table iterator;
data OUT_DS_HASH;
  set BRANDS ;
  if _N_=1 then do;
    dcl hash PHRASES(dataset:'PHRASES');
    PHRASES.definekey('PHRASE_ID');
    PHRASES.definedata('PHRASE','PHRASE_ID');
    PHRASES.definedone();
    declare hiter IT('PHRASES');
    if 0 then set PHRASES;
  end;
  RC=IT.first();
  do while (RC = 0);
    if findw(trim(PHRASE), trim(BRAND), ' ') then output;
    RC=IT.next();
  end;
  drop RC;
run cancel;

%let time=%sysfunc(time());

%* Find the maximum word count in brand names;
proc sql noprint;
  select max(countw(BRAND, ' ')) into :max_brand_word_count from BRANDS;
quit;
%put max_brand_word_count=&max_brand_word_count;

%* Extract consecutive phrase words to build possible brands from the phrases;
libname SPEEDY spde "%sysfunc(pathname(WORK))" compress=binary;

data SPEEDY.POSSIBLE_BRAND_STRINGS (keep= PHRASE_ID POSSIBLE_BRAND );
 set PHRASES;
 length POSSIBLE_BRAND $200;
 PHRASE_WORD_NB=countw(PHRASE, ' ');
 do WORD_NB_IN_PHRASE=0 to PHRASE_WORD_NB-1;
   POSSIBLE_BRAND='';
   do NB_WORDS_IN_FAKE_BRAND=1 to &max_brand_word_count.;
     POSSIBLE_BRAND=catx(' ', POSSIBLE_BRAND, scan(PHRASE, WORD_NB_IN_PHRASE+NB_WORDS_IN_FAKE_BRAND, ' '));
     if POSSIBLE_BRAND ne lag(POSSIBLE_BRAND) then output;
   end;
 end;
run;   

%* Match possible brands to actual brands and retrieve full phrase;
proc sql _method;
  create table SPEEDY.OUT_SQL_BREAK_DOWN as
  select unique ph.PHRASE_ID, ph.PHRASE, br.BRAND
  from BRANDS                           br
         inner join
       SPEEDY.POSSIBLE_BRAND_STRINGS    pb
         on br.BRAND = pb.POSSIBLE_BRAND
         left join
       PHRASES                          ph
         on pb.PHRASE_ID = ph.PHRASE_ID
  order by ph.PHRASE_ID;
quit;

%put time=%sysevalf(%sysfunc(time())-&time);

High-Performance SAS Coding - Third Edition

JohnHoughton · Posted 08-07-2016 06:08 AM

Thanks ChrisNZ.

You made the code simpler but by losing some of the functionality.

I went through a similar stage when I was writing the code, but I didn't like the result because it would not exclude 'Apple' if the phrase included a brand like 'ROBINSONS Fruit Shoot Apple & Blackcurrant' . That's why I set up the repeating macro to search for the longest possible match first and removing it from the string.

You did make me realise that my original code already handled multiple brands in a phrase

ChrisNZ · Posted 08-07-2016 08:27 AM

@JohnHoughton Good point. A bit of post clean up is easier imho.


data SPEEDY.POSSIBLE_BRAND_STRINGS (keep= PHRASE_ID POSSIBLE_BRAND START END);
 set PHRASES;
 length POSSIBLE_BRAND $200;
 PHRASE_WORD_NB=countw(PHRASE, ' ');
 START=0;
 do WORD_NB_IN_PHRASE=0 to PHRASE_WORD_NB-1;
   POSSIBLE_BRAND='';
   do NB_WORDS_IN_FAKE_BRAND=1 to &max_brand_word_count.;
     POSSIBLE_BRAND=catx(' ', POSSIBLE_BRAND, scan(PHRASE, WORD_NB_IN_PHRASE+NB_WORDS_IN_FAKE_BRAND, ' '));
     END=START+length(POSSIBLE_BRAND);
     if POSSIBLE_BRAND ne lag(POSSIBLE_BRAND) then output;
   end;
   START+1+length(scan(PHRASE, WORD_NB_IN_PHRASE+1,' '));
 end;
run;   

%* Match possible brands to actual brands and retrieve full phrase;
proc sql _method;
  create table SPEEDY.OUT_SQL_BREAK_DOWN as
  select  ph.PHRASE_ID, ph.PHRASE, br.BRAND, pb.START, pb.END
  from BRANDS                           br
         inner join
       SPEEDY.POSSIBLE_BRAND_STRINGS    pb
         on br.BRAND = pb.POSSIBLE_BRAND
         left join
       PHRASES                          ph
         on pb.PHRASE_ID = ph.PHRASE_ID
  order by ph.PHRASE_ID, BRAND, START;
quit;

proc sql; 
  create table SPEEDY.EMBEDDED_BRAND as 
  select unique a.PHRASE_ID, a.BRAND, a.START
  from SPEEDY.OUT_SQL_BREAK_DOWN   a
     , SPEEDY.OUT_SQL_BREAK_DOWN   b
  where a.PHRASE_ID eq b.PHRASE_ID 
    and b.START < a.START   
    and a.END   < b.END          
   order by a.PHRASE_ID, BRAND, START;
quit;     

data SPEEDY.OUT_SQL_BREAK_DOWN2;
  merge SPEEDY.OUT_SQL_BREAK_DOWN 
        SPEEDY.EMBEDDED_BRAND     (in=EMBEDDED_BRAND);
  by PHRASE_ID BRAND START;
  if not EMBEDDED_BRAND;
run;

High-Performance SAS Coding - Third Edition

JohnHoughton · Posted 08-07-2016 08:02 PM

Hi Chris
I like the post clean up idea, but I couldn't work out a logic for it so I did the repeat macro.

If you try the code with my original example BRANDS data, it leaves some unwanted brand names.
For example in my brands data table it includes the following made-up brands.
'b1'
'b1 b2'
'b1 b2 b3'
'b1 b2 b3 b4'
'b2'
'b2 b3'
'b2 b3 b4'
'b3'
'b3 b4'
So say a phrase includes brand 'b1 b2 b3 b4' and the phrase goes
'blah blah blah b1 b2 b3 b4 blah blah blah'
Then the output in SPEEDY.OUT_SQL_BREAK_DOWN goes

BRAND	START	END	b.START < a.START and a.END < b.END (so output into embedded_brand)
'b1'	15	17
'b1 b2'	15	20
'b1 b2 b3'	15	23
'b1 b2 b3 b4'	15	26
'b2'	18	20	yes
'b2 b3'	18	23	yes
'b2 b3 b4'	18	26
'b3'	21	23	yes
'b3 b4'	21	26
'b4'	24	26

So the output in SPEEDY.OUT_SQL_BREAK_DOWN2 is

BRAND	START	END
'b1'	15	17
'b1 b2'	15	20
'b1 b2 b3'	15	23
'b1 b2 b3 b4'	15	26
'b2 b3 b4'	18	26
'b3 b4'	21	26
'b4'	24	26

instead of just 'b1 b2 b3 b4' , which I get with my repeat macro logic.

Maybe it isn't realistic to expect any real life brand names that mirror my made-up brand names, and in the real world your logic may work 99.9% of the time, but with over 25 million brand names in the real data , who knows.

I do think the post clean-up solution needs a tweak to match my repeat macro logic on data like my made-up brands data. But I
am confident that you will be able to find it.

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Re: Big Data - Cartesian

Classroom Training Available!