DATA Step, Macro, Functions and more

SAS SQL self join on large datasets

Reply
Occasional Contributor
Posts: 5

SAS SQL self join on large datasets

Hello,

 

Wondering what others have done in a similar situation - I have a long dataset with account numbers and monthly performance data, which I need to convert into a wide dataset (in essence appending subsequent months' performance data to the previous month's record).

 

I do this by first putting together the monthly files into 1 large dataset (the Month0 dataset below) and doing a left join on this dataset. The Month0 dataset has millions of rows.

 

Are there any SAS options to improve the efficiency of this query, or are there modifications to this query anyone can suggest? Thanks!

 

%macro BuildLookouts(begdate, enddate);
 %let start=%sysfunc(inputn(&begdate,anydtdte9.));
 %let end=%sysfunc(inputn(&enddate,anydtdte9.));

 /*determines the number of months between the two dates*/
 %let dif=%sysfunc(intck(month,&start,&end));
 %put dif=&dif.;

 %do i=1 %to &dif;

  proc sql;
   create table Lookout_Month&i as

   select a.*
    ,b.perf_month_month&i
    ,b.perf_MTD_month&i
   from Lookout_Month%eval(&i-1) a
   left join Lookout_Month0(rename=(

    perf_month = perf_month_month&i
    perf_MTD = perf_MTD_month&i
    )) b
    on a.acct_number = b.acct_number
    and intnx('month',a.perf_month,&i,'e') = b.perf_month_month&i
   ;
  quit;

 %end;
%mend;

%BuildLookouts(31JAN2016, 31DEC2017);

Valued Guide
Posts: 561

Re: SAS SQL self join on large datasets

Have you tried Proc Transpose? I would have thought that would be more efficient than an SQL solution here.

Contributor
Posts: 34

Re: SAS SQL self join on large datasets

In a project I am coding, we ran into a problem where we have billions of rows of data that we need to make into a wide table, do some processing on it, then have the data written out the db.  The size of the problem makes it so we literally would be processing for days just in the first step of reconstituting the file, but we decided to go a different way:

 

1 sort the data that we need to process.

2 make in memory arrays to hold the data, for you it sounds like you need something by months, we had a much wider need 386 elements wide.

3 process the dataset by your unique identifier that you sorted by

4 output only the data you need for your answer.

 

we came up with some pretty ingenious ways of looking at the problem and it works in a datastep.  Instead of days and days of processing, out complete answer is done in 8 hours.  We massage 6.7 Billion rows of data in memory to achieve our answer. We process 12.7 Million rows per minute.

Occasional Contributor
Posts: 5

Re: SAS SQL self join on large datasets

Thanks all for your responses.

 

Morgair - I am particularly intrigued by your solution. Are you able to share your code with me? Thanks!

Esteemed Advisor
Posts: 5,409

Re: SAS SQL self join on large datasets

I would stay away from SQL, macro programming and of proc transpose for this task. I would do something like:

 

1)

proc sort on account number with a where clause to extract the required months to dataset extractedData
2)

data _null_ for building the lists of column names in macro variables &perf and &mtd of the form perf_month_month1-perf_month_month20 and perf_mtd_month1-perf_mtd_month20, say
3)
data want;
array perf &perf.;
array mtd &mtd.;
do until(last.accountNumber);
    set extractedData; by accountNumber;
    * get array index from date, assign values;
    end;
run;

PG
Super User
Posts: 9,612

Re: SAS SQL self join on large datasets

You say

 

I do this by first putting together the monthly files into 1 large dataset (the Month0 dataset below) and doing a left join on this dataset. The Month0 dataset has millions of rows.

 

Since you have monthly datasets already, why not do this:

data _null_;
dif = intck('month',input("&begdate",anydtdte9.),input("&enddate",anydtdte9.));
call execute('data month_all; merge ');
do i = 1 to dif;
  call execute('month' !! strip(put(i,best.)) !! '(rename=(perf_month=perf_month' !! strip(put(i,best.)) !! ')) ');
end;
call execute('; by acct_number; run;');
run;

without building the monster dataset.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Contributor
Posts: 34

Re: SAS SQL self join on large datasets

Before I could release the code, I had to sterilize it, the names should be consistent representations of the original variables and each are surrounded by "<>" to denote that they are the sterilized names.

 

The code is pretty simple, but it does our processing very fast:

 

proc sql;
  create table <localFlagSource> (replace=yes) as 
    select <elementType>, <flagName> from <serverDataSource1> order by <elementType>;
  Create table <localFormulaSource> (replace=yes) as 
    select <formulaType>, <SASFormula> from <serverDataSource2> order by <formulaType>;
quit;
%let <elementTypeMacroVar>=;
%let <flagListMacroVar>=;
%let <formulaSourceMacroVar>=;
%let <formulaMacroVar>=;
proc sql NOPRINT;
  select <elementType>, <flagName> into :<elementTypeMacroVar> separated by ' ', :<flagListMacroVar> separated by ' ' from <localFlagSource> order by <elementType>;
  /* separated by nothing is correct for loading the <formulaMacroVar> to pack the formuals together--each is terminated by a semicolon */
  select <formulaType>, <SASFormula> into :<formulaSourceMacroVar> separated by ' ', :<formulaMacroVar> separated by '' from <localFormulaSource> order by <formulaType>;
quit;
%let <formulaMacroVar>=%SUPERQ(<formulaMacroVar>);
%putTime;
data localSAS.test (replace=yes);
  <calculatedField>=0;
  array flag[*] &<flagListMacroVar>.;
  retain &<flagListMacroVar>.;
  set localSAS.<localSource> /*(obs=100000000)*/ end=eof;
  by <uniqueKey>;
  if first.<uniqueKey> then do; /* blank our in memory array for each new grouping */
    do i=1 to Dim(flag);
	  flag[i]=0;
	end;
  end;
  flag[findw("&<elementTypeMacroVar>.", strip(put(<elementType>, best.)), ' ', 'E')]=1; /* reconstitue our data array in memory only */
  if last.<uniqueKey> then do;
    &<formulaMacroVar>.; /* entire list of formulas to process by each true clause contains and output */
  end;
  keep <uniqueKey> <calculatedField>;
run;

 

The few lines of the datastep are where we do the processing, it may be a little misleading: there are approximately 400 formulas that load into formulaMacroVar and are expanded into the code for in memory processing through each loop. Our data array is only reconstituted in memory by using the macro variable that contains a list of the variable names and elements are set appropriately in the lookup to allow the formulas to work--each formula looks for a 1 in elements used for calculation purposes.  The list is cleared at the beginning of each data group in the source file of the datastep.

 

 

The processing is done locally on my workstation:

Processor Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz, 2394 Mhz, 10 Core(s), 20 Logical Processor(s)

and processes through our source in the datastep at the rate of about 12.7 Million observations per minute.

Ask a Question
Discussion stats
  • 6 replies
  • 210 views
  • 4 likes
  • 5 in conversation