Transpose - Page 2

Astounding · Posted 01-15-2013 02:59 PM

Art,

I like the general idea, but my gut says you would get faster results by switching from MERGE to SET. You would have to go through the usual complications ... what to do if first.id, what to retain, what to do if last.id. I'm up to my eyeballs for the next few weeks and won't be able to test (sorry).

Good luck.

art297 · Posted 01-15-2013 06:04 PM

: Your gut was telling you the truth. It reduced the run on the 50 million test case from 7 minutes to 2 minutes.

: I haven't had time to turn this into a general solution yet, but here is the code (based on Asounding's sugestion), that accomplished the test run in 2 minutes. It will still require sorted data, but doesn't require the upfront sql code as I hard coded the variables, and should be able to accomodate your full length records.

data want;

array want(48)ind1_2011JAN ind2_2011JAN ind3_2011JAN ind4_2011JAN

ind1_2011FEB ind2_2011FEB ind3_2011FEB ind4_2011FEB

ind1_2011MAR ind2_2011MAR ind3_2011MAR ind4_2011MAR

ind1_2011APR ind2_2011APR ind3_2011APR ind4_2011APR

ind1_2011MAY ind2_2011MAY ind3_2011MAY ind4_2011MAY

ind1_2011JUN ind2_2011JUN ind3_2011JUN ind4_2011JUN

ind1_2011JUL ind2_2011JUL ind3_2011JUL ind4_2011JUL

ind1_2011AUG ind2_2011AUG ind3_2011AUG ind4_2011AUG

ind1_2011SEP ind2_2011SEP ind3_2011SEP ind4_2011SEP

ind1_2011OCT ind2_2011OCT ind3_2011OCT ind4_2011OCT

ind1_2011NOV ind2_2011NOV ind3_2011NOV ind4_2011NOV

ind1_2011DEC ind2_2011DEC ind3_2011DEC ind4_2011DEC;

retain want;

set have;

by id;

if first.id then call missing(of want(*));

want((month(date)-1)*4+1)=ind1;

want((month(date)-1)*4+2)=ind2;

want((month(date)-1)*4+3)=ind3;

want((month(date)-1)*4+4)=ind4;

drop date ind1-ind4;

if last.id then output;

run;

data_null__ · Posted 01-15-2013 06:36 PM

You might consider this technique to create the array.

proc summary nway data=have(keep=date);
   class date;
   output out=udate(drop=_type_ _Freq_);
   run; 
data udate;
   set udate;
   do i = 1 to 4; 
      _name_ = cats('ind',i);
      output; 
      end; 
   run; 
proc sort; 
   by i date;
   run; 
proc transpose data=udate delimiter=_ out=frame(drop=_name_);
   var i;
   id _name_ date;
   run;

Then use data FRAME in your data step with an unexecuted SET. Saves typing the array statement.

   if 0 then set frame;
   array want 
_numeric_;

This is the first two lines of data want.

art297 · Posted 01-15-2013 09:05 PM

DN: Most definitely! I was surprised to discover that the upfront proc summary call ran in around 30% of the time it took the initial proc sql call to run. While this was an SGF quality paper already with just Ksharp's idea, combined with your, Asounding's and my input, it will definitely be a nice addition to the SAS-related literature.

data_null__ · Posted 01-16-2013 10:15 AM

I wonder if there is much or any performance advance by factoring out the MONTH function

if first.id then call missing(of want(*));

m = (month(date)-1)*4;

want(m+1)=ind1;

want(m+2)=ind2;

want(m+3)=ind3;

want(m+4)=ind4;

drop date m ind1-ind4;

if last.id then output;

art297 · Posted 01-16-2013 03:00 PM

DN: Factoring out the month function cut the time in half! However, rather than use your suggestion of proc summary and transpose, I did the same thing using proc sql to create a macro variable. The only difference is with respect to the ordering of the output variables.

: The current version of the code, as shown below, took 75 seconds cpu time and just under 5 minutes real time for 50 million records. I speeded up the proc sql part by limiting it to only analyzing the first 1,000 records. That number would have to be increased if all 12 months don't exist somewhere in the first 1,000 records:

/* Create some test data */

/* Note: Every other record in the dataset is missing values for one month */

data have (drop=months);

format id best32.;

input date ind1-ind4 ;

informat date date9.;

format date yymon7.;

other_variable=2;

do id=1 to 5000000;

date="01dec2010"d;

do months=1 to 12;

date=intnx('month',date,1);

if not(months eq 9 and mod(id,2)) then output;

end;

cards;

01dec2010 1 2 3 4

;

/* Create a macro variable that specifies the desired variable names after transposing */

/* Note: the &junk variable is only created so that proc sql doesn't return a warning */

proc sql noprint;

select distinct 'ind1_'||put(date,yymon7.)||

' ind2_'||put(date,yymon7.)||

' ind3_'||put(date,yymon7.)||

' ind4_'||put(date,yymon7.),

date

into :vars separated by " ",

:junk

from have (obs=1000) /*must be big enough to capture all months*/

order by date

;

quit;

data want;

set have;

by id;

array want(*) &vars.;

retain want;

if first.id then call missing(of want(*));

_n_=month(date);

want((_n_-1)*4+1)=ind1;

want((_n_-1)*4+2)=ind2;

want((_n_-1)*4+3)=ind3;

want((_n_-1)*4+4)=ind4;

drop date ind1-ind4;

if last.id then output;

run;

Peter_C · Posted 01-17-2013 07:18 PM

for large data volumes, some performance might be gained using DOW loop rather than testing if first.id and setting so many to missing

something like

data want;

do until( last.id) ;

set have;

by id;

array want(*) &vars.;

_n_=month(date);

want((_n_-1)*4+1)=ind1;

want((_n_-1)*4+2)=ind2;

want((_n_-1)*4+3)=ind3;

want((_n_-1)*4+4)=ind4;

drop date ind1-ind4;

end;

run;

art297 · Posted 01-17-2013 08:21 PM

Peter,

Quite possibly. I'll try to get around to testing it later tonight.

Art

art297 · Posted 01-17-2013 11:03 PM

: Didn't work. I included the call missing to ensure that the array was clear in the event that a record has some missing data. Using a DOW loop, without the call, didn't appear to clear the array after each ID.

Peter_C · Posted 01-18-2013 03:21 AM

Art

Please show your code. (If you haven't accidentally added a RETAIN to the code I suggested)

Peter.

art297 · Posted 01-29-2013 09:25 PM

: Yes, I had left a retain statement in the code. However, surprisingly (to me at least), your version doesn't appear to run any faster. However, there was a lot of variance between subsequent runs.

Overall, I think your version runs slightly faster, but not consistently. Here is what I compared:

data have;

array var(*) var1-var1000;

do idnum=1 to 100000;

date="01dec2010"d;

do months=3 to 12 by 3;

date=intnx('month',date,3);

do i=1 to 10;

var(i) = ceil( 9*ranuni(123) );

end;

if months eq 6 and mod(idnum,2) then

call missing(var(1));

output;

end;

run;

data work.want;

set have (keep=idnum date var1);

by idnum notsorted;

array want_num(*)

var1_Qtr1

var1_Qtr2

var1_Qtr3

var1_Qtr4

;

array have_num(*) var1;

retain want_num;

if first.idnum then call missing(of want_num(*));

___nnum=put(date,labelfmt.)*dim(have_num);

do ___i=1 to dim(have_num);

want_num(___nnum+___i)=have_num(___i);

end;

drop date ___: var1;

if last.idnum then output;

run;

data work.want2;

do until (last.idnum);

set have (keep=idnum date var1);

by idnum;

array want_num(*)

var1_Qtr1

var1_Qtr2

var1_Qtr3

var1_Qtr4

;

array have_num(*) var1;

___nnum=put(date,labelfmt.)*dim(have_num);

do ___i=1 to dim(have_num);

want_num(___nnum+___i)=have_num(___i);

end;

drop date ___: var1;

if last.idnum then output;

end;

run;

data_null__ · Posted 01-18-2013 07:55 AM

Peter Crawford wrote:


for large data volumes, some performance might be gained using DOW loop rather than testing if first.id and setting so many to missing

something like

data want;
do until( last.id) ;
  set have;
  by id;
  array want(*) &vars.;
  _n_=month(date);
  want((_n_-1)*4+1)=ind1;
  want((_n_-1)*4+2)=ind2;
  want((_n_-1)*4+3)=ind3;
  want((_n_-1)*4+4)=ind4;
  drop date ind1-ind4;
end;
run;

Seems like there might be some small additional performance gain to moving -1)*4 to the assignment of month. _n_=(month(date)-1)*4.

art297 · Posted 01-18-2013 09:05 AM

That will definitely make the code run faster.

SASPhile · Posted 01-29-2013 02:07 PM

Art....there are a lot of changes happening.Havent had a chance to test it,

SASPhile · Posted 01-16-2013 08:31 PM

Art,

There are a lot of changes happening to the existing business logic.I will give you an update once everythjng is finalized,..like how many variabes to include and what variables need processing etc and will run the data step method and let you know about the cpu time.

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away