Re: How to improve Efficiency

monona · Posted 04-17-2019 08:32 PM

I am left-joining two tables. table A has 11M rows with 600mb and the table B is 430mb. If i just run this code only with the first 1000 rows of table A, it produced the output as desired.(takes only few seconds) However, If I run this whole thing, it doesn't end.

proc sql;
    create table want as
    select a.*, b.rsid
    from A
    left join B
    on a.chr=b.chr and b.start <= a.bp <= b.end
    order by id;
quit;

Since my computer has 64gb Ram, I modified SAS memory size from 2gb to 6gb on cfg file. What can I do to improve the speed.

I modified

ChrisNZ · Posted 04-17-2019 10:57 PM

1. Keeping both your tables sorted by variable CHR would hugely improve speed

2. Since you have 64 GB of RAM, you can increase your memory settings to use that, if you are the only user.

For example options MEMSIZE=60G REALMEMSIZE=60G.

Since you also sort, increasing option SORTSIZE may also be useful, though the documentation only links it to PROC SORT.

3. mb would translate to milli bit, MegaByte usually written as MB

High-Performance SAS Coding - Third Edition

ScottBass · Posted 04-18-2019 12:20 AM

@ChrisNZ is a performance guru (buy his book 😉 ) so what he said.

However, here are a few more generic options. You may have to hit the docs on some of this...

1) Use proc sql _method option. Google if you don't know what this is. If you see that proc sql is sorting your datasets, stop. I have this personal saying "Sorting is evil". Well, not exactly, but it is usually one of the first places I look when I'm having performance issues, esp. as data volumes increase (including when the data is "wide").

2) As Chris said, sort your data, then see https://support.sas.com/resources/papers/proceedings/proceedings/sugi29/127-29.pdf, search on "sortedby". Asserting that your data is already sorted may convince the proc sql query optimizer not to sort your data.

3) Create an index on B.chr, then use an index key lookup to do the join. See http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000173782.htm (Example 7 & 8). Since your key will not be unique, you'll need to loop, checking _IORC_, with a subsetting IF statement to only keep the record where start <= bp <= end.

4) Similar approach but use a HASH object. See https://github.com/scottbass/SAS/blob/master/Macro/hash_define.sas. Search the header for multidata. Hopefully you can understand what the multidata code is doing. Note the header use cases are self-contained; you could run them, view the generated code, and modify to suit your needs.

5) These are likely permanent datasets given their size. However, if you have some control over these datasets (or they are work datasets), consider using the SPDE engine instead of the BASE engine.

If it were me I'd investigate #3 and #4. The code may be less "syntactically elegant" than proc sql but you have more control and will likely get better performance, esp. if proc sql is sorting your two tables to do the join.

Hope this helps...

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

ScottBass · Posted 04-18-2019 12:40 AM

P.S.: Here's a simple index key lookup step based on the code in #4 hash_define header:

* Multidata lookup - better source/lookup example datasets ;
data lookup;
  set
    sashelp.class
    sashelp.class
    sashelp.class
  ;
run;
proc sort;
  by name;
run;
* build dummy date ranges ;
data lookup;
  set lookup;
  date=(_n_-1)*3;
  format date date7.;
run;
* create index ;
proc datasets lib=work nolist;
   modify lookup;
   index create name;
quit;

data joined;
   set sashelp.class (keep=name);
   * capture value of _ERROR_ before the lookup ;
   error=_ERROR_;
   set lookup key=name;
   do while (_IORC_=0);
      if "01JAN60"d le date le "01FEB60"d then do;
         output;
         leave;  * assumes no overlapping date ranges ;
      end;
      set lookup key=name;  * next lookup ;
   end;
   * SAS annoyingly treats a missed lookup as an error, so reset _ERROR_ ;
   _ERROR_=error;
   drop error;
run;

It would be a good idea to walk through this code using the data step debugger to trace its logic. Beware that you can create an infinite loop if you're not careful (as I did before using the debugger - I'd forgotten the next lookup step inside the loop!)

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

ChrisNZ · Posted 04-18-2019 01:10 AM

Thanks @ScottBass ! 🙂

Another trick (it's in the book!) to hugely speed up SQL joins is to use equi-joins.

data T1(keep=START END)
     T2(keep=BP);
  do START= 1 to 2e4;
    BP  = START+int((ranuni(1)-.5)*400);
    END = START+int(ranuni(2)*300);
    output; 
  end;
run;
     
proc sql _method;
  create table OUT1 as 
  select START, END, BP
  from T1, T2 
  where START <= BP <= END;
quit;

data DIF;
  do DAYS= 0 to 500;
    output; 
  end; 
run;

proc sql _method;
  create table OUT2 as 
  select START, END, BP
  from T1, T2, DIF D1, DIF D2 
  where BP  = D1.DAYS + START
    and END = D2.DAYS + BP;
quit;

NOTE: Table WORK.OUT1 created, with 2985443 rows and 3 columns.

real time 13.49 seconds
user cpu time 13.39 seconds

NOTE: Table WORK.OUT2 created, with 2985443 rows and 3 columns.

real time 3.81 seconds
user cpu time 3.71 seconds

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 04-22-2019 06:44 PM

If you can, using a hash table avoids a Cartesian product and slashes the run time.

For example:

proc sql _method;
  create table OUT1 as 
  select unique START, END, BP
  from T1, T2 
  where START <= BP <= END ;
quit;

data DIF;
  do DAYS= 0 to 500;
    output; 
  end; 
run;

proc sql _method;
  create table OUT2 as 
  select unique START, END, BP
  from T1, T2, DIF D1, DIF D2 
  where BP  = D1.DAYS + START
    and END = D2.DAYS + BP ;
quit;

data OUT3;  
  set T1;
  if _N_=1 then do;
    dcl hash H(dataset:'T2');
    H.definekey('BP');
    H.definedata('BP');
    H.definedone();
    BP=.;
  end;
  do I= START to END;
    RC=H.find(key:I);
    if RC=0 then do;
      output;
    end;
  end;
run;

NOTE: Table WORK.OUT1 created, with 1895918 rows and 3 columns.

real time 17.97 seconds
user cpu time 15.31 seconds

NOTE: Table WORK.OUT2 created, with 1895918 rows and 3 columns.

real time 5.00 seconds
user cpu time 5.37 seconds
system cpu time 0.39 seconds

NOTE: The data set WORK.OUT3 has 1895918 observations and 5 variables.
real time 0.64 seconds
user cpu time 0.57 seconds

High-Performance SAS Coding - Third Edition

How to improve Efficiency