Transpose

SASPhile · Posted 01-13-2013 10:31 PM

There are 700 million records. The columns are to transformed to this way:
id           date                         ind1   ind2   ind3   ind4
18010    Sep 1 2011                   0      0       0      0
18010   Oct 1 2011                       0       0       0     0
18010   Nov 1 2011                      0      0       0      0
18010   Dec 1 2011                       1      0       0      0

id ind1_sep2011 ind1_oct2011 ind1_nov2011 ind1_dec2011
18010 0 0 0 1

For display purposes I just showed only ind1.What is the most efficient way to achive this format?

art297 · Posted 01-13-2013 11:43 PM

Are you sure that is what you want?

I would think that you would want something like:

id ind1 ind2 ind3 ind4

18010 1Dec2011 . . .

If not, aren't you also interested in the other three INDs and are the dates shown the only ones in your data?

SASPhile · Posted 01-14-2013 12:00 AM

There are fout indicators ind1--ind4 and there are 12 months.Sep2011--sep2012.All the dat columns have to be transposed thus making 48 columns plus id. Ind1 for sep2011 will be Ind_sep2011 ,ind1 for oct2011 will be ind1_oct2011 ...ind4_sep2012.

Tom · Posted 01-14-2013 08:38 AM

Double transpose would seem the easiest way.

data have ;

input id date ind1-ind4 ;

informat date date9.;

format date yymon7.;

cards;

18010 01SEP2011 1 2 3 4

18010 01OCT2011 5 6 7 8

18010 01NOV2011 9 1 2 3

18010 01DEC2011 4 5 6 7

;

proc transpose data=have out=tall ;

by id date;

var ind1-ind4;

run;

proc transpose data=tall out=want delimiter=_;

by id;

id _name_ date;

var col1;

run;

data _null_; set;

put (_ALL_) (=/);

run;

id=18010

_NAME_=COL1

ind1_2011SEP=1

ind2_2011SEP=2

ind3_2011SEP=3

ind4_2011SEP=4

ind1_2011OCT=5

ind2_2011OCT=6

ind3_2011OCT=7

ind4_2011OCT=8

ind1_2011NOV=9

ind2_2011NOV=1

ind3_2011NOV=2

ind4_2011NOV=3

ind1_2011DEC=4

ind2_2011DEC=5

ind3_2011DEC=6

ind4_2011DEC=7

art297 · Posted 01-14-2013 10:57 AM

: Given the size of your file, I thought this might be a good time to introduce (and, hopefully, get some feedback regarding) a concept that KSharp came up with and he and I are formulating for a future conference paper, namely using a data step to accomplish such transpose tasks.

Here is an example using a file like yours, although I've only tested it with 1.2 million and 12 million records. With 1.2 million records the data step method runs more than 10 times faster than using proc transpose. With 12 million records, the difference amplified to 15 times faster and I presume that will continue to increase with even larger files like yours.

Since both this and 's proposed code require that the data be sorted, I didn't include those times in the comparison.

data have ;

length id 8.;

input date ind1-ind4 ;

informat date date9.;

format date yymon7.;

do id=1 to 100000;

output;

end;

cards;

01jan2011 1 2 3 4

01feb2011 5 6 7 8

01mar2011 9 1 2 3

01apr2011 4 5 6 7

01may2011 1 2 3 4

01jun2011 5 6 7 8

01jul2011 9 1 2 3

01aug2011 4 5 6 7

01SEP2011 1 2 3 4

01OCT2011 5 6 7 8

01NOV2011 9 1 2 3

01DEC2011 4 5 6 7

;

proc sort data=have;

by id date;

run;

/*transpose long to wide using data step*/

proc sql;

create table temp as

select distinct date from have

;

quit;

data _null_;

set temp end=last;

if _n_ eq 1 then call execute('data want; merge ');

call execute(catt('have(where=(date=',date,')',

' rename=( ind1=ind1_',left(put(date,yymon7.)),

' ind2=ind2_',left(put(date,yymon7.)),

' ind3=ind3_',left(put(date,yymon7.)),

' ind4=ind4_',left(put(date,yymon7.)),'))'));

if last then call execute('; by id; drop date; run;');

run;

/*compare with transpose using proc transpose*/

proc transpose data=have out=tall ;

by id date;

var ind1-ind4;

run;

proc transpose data=tall out=want delimiter=_;

by id;

id _name_ date;

var col1;

run;

SASPhile · Posted 01-14-2013 01:38 PM

Hi Arthur,

I used your data step for 50 Mil records and it completed way faster than expected. I could not keep the old Log as I had overwriiten with other process.

art297 · Posted 01-14-2013 03:28 PM

: I'm currently testing out DN's suggestion of testing some of the limits, but quickly confronted the log situation. That is easy to circumvent by just rerouting the log. I wrapped the code with:

proc printto log="c:\art\testlog";

run;

/* the code */

proc printto;

run;

data_null__ · Posted 01-14-2013 02:06 PM

Have you tested when you have lots more dates? Is there a limit to the number of data sets that can be merged.

Or course you could probabley use multiple data steps and still be faster than proc transpose.

art297 · Posted 01-14-2013 02:47 PM

dn: what would you suggest as "a lot more dates", specifically, how large would you suggest?

The OP for this thread only needed 12 dates, so I didn't push it. Interestingly, using proc transpose on just the 4 replications of the 12 dates on 12 million records almost blew out my ram.

data_null__ · Posted 01-14-2013 03:11 PM

I don't know how many. I was thinking that for the WHEREed MERGE technique that the number of MERGE data sets is the limiting factor (or at least I think so) and I would like to know where that limit is.

art297 · Posted 01-14-2013 05:23 PM

DN: I've tested it, both with using call execute and with creating and running an include program.

I successfully ran 256, 512 and 1024 merges, but the system crashed as 2048. But, even with 1024, that was creating a file that was 4,097 variables wide. I wouldn't recommend anyone creating that wide of a file anyway.

The limitations, also, may just be due to limited ram. Can you try the following on your computer and, if it works, increasing it to 4096?:

data have (drop=months);

format id best32.;

input date ind1-ind4 ;

informat date date9.;

format date yymon7.;

do id=1 to 2;

date="01dec2010"d;

do months=1 to 2048;

date=intnx('month',date,1);

output;

end;

cards;

01dec2010 1 2 3 4

;

proc printto log="c:\testlog";

run;

/*transpose long to wide using data step*/

proc sql;

create table temp as

select distinct date from have

;

quit;

data _null_;

set temp end=last;

if _n_ eq 1 then call execute('data want; merge ');

call execute(catt('have (where=(date=',date,')',

' rename=( ind1=ind1_',left(put(date,yymon7.)),

' ind2=ind2_',left(put(date,yymon7.)),

' ind3=ind3_',left(put(date,yymon7.)),

' ind4=ind4_',left(put(date,yymon7.)),'))'));

if last then call execute('; by id; drop date; run;');

run;

proc printto;

run;

SASPhile · Posted 01-15-2013 01:42 PM

Art,

It took over 3 hrs with 50 mil records.

art297 · Posted 01-15-2013 02:33 PM

: Then I would guess that your input file has quite a few extra variables in it or that there is something else going on. I tested the process on 50 million records, on an extremely old computer with extremely limited ram, and it only took 7 minutes.

SASPhile · Posted 01-15-2013 04:37 PM

Yes there are extra variables.

art297 · Posted 01-15-2013 04:51 PM

: Are they the same for all of a subject's 12 records? If not, you will have a problem with any method as only the last record's info will be kept.

If the values for those variables are the same for all 12 records, you can speed up the process by (after sorting the file) extracting the data to be transposed (i.e., id, date and the four ind variables) to one file and, for the last.id, outputting just id and the variables that aren't part of the transpose, running the transpose on the first of those files, and then merging the results with the second one.

BTW, the 7 minute proc execute solution (which took the same cpu and real time) took 40.4 minutes cpu time and over an hour real time using the double transpose method.

Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

Re: Transpose

SAS Innovate 2025: Register Today!

SAS Training: Just a Click Away