BookmarkSubscribeRSS Feed
PS185
Calcite | Level 5

Hello,

 

Wondering what others have done in a similar situation - I have a long dataset with account numbers and monthly performance data, which I need to convert into a wide dataset (in essence appending subsequent months' performance data to the previous month's record).

 

I do this by first putting together the monthly files into 1 large dataset (the Month0 dataset below) and doing a left join on this dataset. The Month0 dataset has millions of rows.

 

Are there any SAS options to improve the efficiency of this query, or are there modifications to this query anyone can suggest? Thanks!

 

%macro BuildLookouts(begdate, enddate);
 %let start=%sysfunc(inputn(&begdate,anydtdte9.));
 %let end=%sysfunc(inputn(&enddate,anydtdte9.));

 /*determines the number of months between the two dates*/
 %let dif=%sysfunc(intck(month,&start,&end));
 %put dif=&dif.;

 %do i=1 %to &dif;

  proc sql;
   create table Lookout_Month&i as

   select a.*
    ,b.perf_month_month&i
    ,b.perf_MTD_month&i
   from Lookout_Month%eval(&i-1) a
   left join Lookout_Month0(rename=(

    perf_month = perf_month_month&i
    perf_MTD = perf_MTD_month&i
    )) b
    on a.acct_number = b.acct_number
    and intnx('month',a.perf_month,&i,'e') = b.perf_month_month&i
   ;
  quit;

 %end;
%mend;

%BuildLookouts(31JAN2016, 31DEC2017);

6 REPLIES 6
ChrisBrooks
Ammonite | Level 13

Have you tried Proc Transpose? I would have thought that would be more efficient than an SQL solution here.

morgalr
Obsidian | Level 7

In a project I am coding, we ran into a problem where we have billions of rows of data that we need to make into a wide table, do some processing on it, then have the data written out the db.  The size of the problem makes it so we literally would be processing for days just in the first step of reconstituting the file, but we decided to go a different way:

 

1 sort the data that we need to process.

2 make in memory arrays to hold the data, for you it sounds like you need something by months, we had a much wider need 386 elements wide.

3 process the dataset by your unique identifier that you sorted by

4 output only the data you need for your answer.

 

we came up with some pretty ingenious ways of looking at the problem and it works in a datastep.  Instead of days and days of processing, out complete answer is done in 8 hours.  We massage 6.7 Billion rows of data in memory to achieve our answer. We process 12.7 Million rows per minute.

PS185
Calcite | Level 5

Thanks all for your responses.

 

Morgair - I am particularly intrigued by your solution. Are you able to share your code with me? Thanks!

PGStats
Opal | Level 21

I would stay away from SQL, macro programming and of proc transpose for this task. I would do something like:

 

1)

proc sort on account number with a where clause to extract the required months to dataset extractedData
2)

data _null_ for building the lists of column names in macro variables &perf and &mtd of the form perf_month_month1-perf_month_month20 and perf_mtd_month1-perf_mtd_month20, say
3)
data want;
array perf &perf.;
array mtd &mtd.;
do until(last.accountNumber);
    set extractedData; by accountNumber;
    * get array index from date, assign values;
    end;
run;

PG
Kurt_Bremser
Super User

You say

 

I do this by first putting together the monthly files into 1 large dataset (the Month0 dataset below) and doing a left join on this dataset. The Month0 dataset has millions of rows.

 

Since you have monthly datasets already, why not do this:

data _null_;
dif = intck('month',input("&begdate",anydtdte9.),input("&enddate",anydtdte9.));
call execute('data month_all; merge ');
do i = 1 to dif;
  call execute('month' !! strip(put(i,best.)) !! '(rename=(perf_month=perf_month' !! strip(put(i,best.)) !! ')) ');
end;
call execute('; by acct_number; run;');
run;

without building the monster dataset.

morgalr
Obsidian | Level 7

Before I could release the code, I had to sterilize it, the names should be consistent representations of the original variables and each are surrounded by "<>" to denote that they are the sterilized names.

 

The code is pretty simple, but it does our processing very fast:

 

proc sql;
  create table <localFlagSource> (replace=yes) as 
    select <elementType>, <flagName> from <serverDataSource1> order by <elementType>;
  Create table <localFormulaSource> (replace=yes) as 
    select <formulaType>, <SASFormula> from <serverDataSource2> order by <formulaType>;
quit;
%let <elementTypeMacroVar>=;
%let <flagListMacroVar>=;
%let <formulaSourceMacroVar>=;
%let <formulaMacroVar>=;
proc sql NOPRINT;
  select <elementType>, <flagName> into :<elementTypeMacroVar> separated by ' ', :<flagListMacroVar> separated by ' ' from <localFlagSource> order by <elementType>;
  /* separated by nothing is correct for loading the <formulaMacroVar> to pack the formuals together--each is terminated by a semicolon */
  select <formulaType>, <SASFormula> into :<formulaSourceMacroVar> separated by ' ', :<formulaMacroVar> separated by '' from <localFormulaSource> order by <formulaType>;
quit;
%let <formulaMacroVar>=%SUPERQ(<formulaMacroVar>);
%putTime;
data localSAS.test (replace=yes);
  <calculatedField>=0;
  array flag[*] &<flagListMacroVar>.;
  retain &<flagListMacroVar>.;
  set localSAS.<localSource> /*(obs=100000000)*/ end=eof;
  by <uniqueKey>;
  if first.<uniqueKey> then do; /* blank our in memory array for each new grouping */
    do i=1 to Dim(flag);
	  flag[i]=0;
	end;
  end;
  flag[findw("&<elementTypeMacroVar>.", strip(put(<elementType>, best.)), ' ', 'E')]=1; /* reconstitue our data array in memory only */
  if last.<uniqueKey> then do;
    &<formulaMacroVar>.; /* entire list of formulas to process by each true clause contains and output */
  end;
  keep <uniqueKey> <calculatedField>;
run;

 

The few lines of the datastep are where we do the processing, it may be a little misleading: there are approximately 400 formulas that load into formulaMacroVar and are expanded into the code for in memory processing through each loop. Our data array is only reconstituted in memory by using the macro variable that contains a list of the variable names and elements are set appropriately in the lookup to allow the formulas to work--each formula looks for a 1 in elements used for calculation purposes.  The list is cleared at the beginning of each data group in the source file of the datastep.

 

 

The processing is done locally on my workstation:

Processor Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz, 2394 Mhz, 10 Core(s), 20 Logical Processor(s)

and processes through our source in the datastep at the rate of about 12.7 Million observations per minute.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1340 views
  • 4 likes
  • 5 in conversation