Art,
I like the general idea, but my gut says you would get faster results by switching from MERGE to SET. You would have to go through the usual complications ... what to do if first.id, what to retain, what to do if last.id. I'm up to my eyeballs for the next few weeks and won't be able to test (sorry).
Good luck.
: Your gut was telling you the truth. It reduced the run on the 50 million test case from 7 minutes to 2 minutes.
: I haven't had time to turn this into a general solution yet, but here is the code (based on Asounding's sugestion), that accomplished the test run in 2 minutes. It will still require sorted data, but doesn't require the upfront sql code as I hard coded the variables, and should be able to accomodate your full length records.
data want;
array want(48)ind1_2011JAN ind2_2011JAN ind3_2011JAN ind4_2011JAN
ind1_2011FEB ind2_2011FEB ind3_2011FEB ind4_2011FEB
ind1_2011MAR ind2_2011MAR ind3_2011MAR ind4_2011MAR
ind1_2011APR ind2_2011APR ind3_2011APR ind4_2011APR
ind1_2011MAY ind2_2011MAY ind3_2011MAY ind4_2011MAY
ind1_2011JUN ind2_2011JUN ind3_2011JUN ind4_2011JUN
ind1_2011JUL ind2_2011JUL ind3_2011JUL ind4_2011JUL
ind1_2011AUG ind2_2011AUG ind3_2011AUG ind4_2011AUG
ind1_2011SEP ind2_2011SEP ind3_2011SEP ind4_2011SEP
ind1_2011OCT ind2_2011OCT ind3_2011OCT ind4_2011OCT
ind1_2011NOV ind2_2011NOV ind3_2011NOV ind4_2011NOV
ind1_2011DEC ind2_2011DEC ind3_2011DEC ind4_2011DEC;
retain want;
set have;
by id;
if first.id then call missing(of want(*));
want((month(date)-1)*4+1)=ind1;
want((month(date)-1)*4+2)=ind2;
want((month(date)-1)*4+3)=ind3;
want((month(date)-1)*4+4)=ind4;
drop date ind1-ind4;
if last.id then output;
run;
You might consider this technique to create the array.
Then use data FRAME in your data step with an unexecuted SET. Saves typing the array statement.
DN: Most definitely! I was surprised to discover that the upfront proc summary call ran in around 30% of the time it took the initial proc sql call to run. While this was an SGF quality paper already with just Ksharp's idea, combined with your, Asounding's and my input, it will definitely be a nice addition to the SAS-related literature.
I wonder if there is much or any performance advance by factoring out the MONTH function
if first.id then call missing(of want(*));
m = (month(date)-1)*4;
want(m+1)=ind1;
want(m+2)=ind2;
want(m+3)=ind3;
want(m+4)=ind4;
drop date m ind1-ind4;
if last.id then output;
DN: Factoring out the month function cut the time in half! However, rather than use your suggestion of proc summary and transpose, I did the same thing using proc sql to create a macro variable. The only difference is with respect to the ordering of the output variables.
: The current version of the code, as shown below, took 75 seconds cpu time and just under 5 minutes real time for 50 million records. I speeded up the proc sql part by limiting it to only analyzing the first 1,000 records. That number would have to be increased if all 12 months don't exist somewhere in the first 1,000 records:
/* Create some test data */
/* Note: Every other record in the dataset is missing values for one month */
data have (drop=months);
format id best32.;
input date ind1-ind4 ;
informat date date9.;
format date yymon7.;
other_variable=2;
do id=1 to 5000000;
date="01dec2010"d;
do months=1 to 12;
date=intnx('month',date,1);
if not(months eq 9 and mod(id,2)) then output;
end;
end;
cards;
01dec2010 1 2 3 4
;
/* Create a macro variable that specifies the desired variable names after transposing */
/* Note: the &junk variable is only created so that proc sql doesn't return a warning */
proc sql noprint;
select distinct 'ind1_'||put(date,yymon7.)||
' ind2_'||put(date,yymon7.)||
' ind3_'||put(date,yymon7.)||
' ind4_'||put(date,yymon7.),
date
into :vars separated by " ",
:junk
from have (obs=1000) /*must be big enough to capture all months*/
order by date
;
quit;
data want;
set have;
by id;
array want(*) &vars.;
retain want;
if first.id then call missing(of want(*));
_n_=month(date);
want((_n_-1)*4+1)=ind1;
want((_n_-1)*4+2)=ind2;
want((_n_-1)*4+3)=ind3;
want((_n_-1)*4+4)=ind4;
drop date ind1-ind4;
if last.id then output;
run;
for large data volumes, some performance might be gained using DOW loop rather than testing if first.id and setting so many to missing
something like
data want;
do until( last.id) ;
set have;
by id;
array want(*) &vars.;
_n_=month(date);
want((_n_-1)*4+1)=ind1;
want((_n_-1)*4+2)=ind2;
want((_n_-1)*4+3)=ind3;
want((_n_-1)*4+4)=ind4;
drop date ind1-ind4;
end;
run;
Peter,
Quite possibly. I'll try to get around to testing it later tonight.
Art
Art
Please show your code. (If you haven't accidentally added a RETAIN to the code I suggested)
Peter.
: Yes, I had left a retain statement in the code. However, surprisingly (to me at least), your version doesn't appear to run any faster. However, there was a lot of variance between subsequent runs.
Overall, I think your version runs slightly faster, but not consistently. Here is what I compared:
data have;
array var(*) var1-var1000;
do idnum=1 to 100000;
date="01dec2010"d;
do months=3 to 12 by 3;
date=intnx('month',date,3);
do i=1 to 10;
var(i) = ceil( 9*ranuni(123) );
end;
if months eq 6 and mod(idnum,2) then
call missing(var(1));
output;
end;
end;
run;
data work.want;
set have (keep=idnum date var1);
by idnum notsorted;
array want_num(*)
var1_Qtr1
var1_Qtr2
var1_Qtr3
var1_Qtr4
;
array have_num(*) var1;
retain want_num;
if first.idnum then call missing(of want_num(*));
___nnum=put(date,labelfmt.)*dim(have_num);
do ___i=1 to dim(have_num);
want_num(___nnum+___i)=have_num(___i);
end;
drop date ___: var1;
if last.idnum then output;
run;
data work.want2;
do until (last.idnum);
set have (keep=idnum date var1);
by idnum;
array want_num(*)
var1_Qtr1
var1_Qtr2
var1_Qtr3
var1_Qtr4
;
array have_num(*) var1;
___nnum=put(date,labelfmt.)*dim(have_num);
do ___i=1 to dim(have_num);
want_num(___nnum+___i)=have_num(___i);
end;
drop date ___: var1;
if last.idnum then output;
end;
run;
Peter Crawford wrote:
for large data volumes, some performance might be gained using DOW loop rather than testing if first.id and setting so many to missingsomething like
data want;
do until( last.id) ;
set have;
by id;
array want(*) &vars.;
_n_=month(date);
want((_n_-1)*4+1)=ind1;
want((_n_-1)*4+2)=ind2;
want((_n_-1)*4+3)=ind3;
want((_n_-1)*4+4)=ind4;
drop date ind1-ind4;
end;
run;
Seems like there might be some small additional performance gain to moving -1)*4 to the assignment of month. _n_=(month(date)-1)*4.
That will definitely make the code run faster.
Art....there are a lot of changes happening.Havent had a chance to test it,
Art,
There are a lot of changes happening to the existing business logic.I will give you an update once everythjng is finalized,..like how many variabes to include and what variables need processing etc and will run the data step method and let you know about the cpu time.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.