DATA Step, Macro, Functions and more

Transpose

Reply
Super Contributor
Posts: 647

Transpose

There are 700 million records. The columns are to transformed to this way:
id           date                         ind1   ind2   ind3   ind4
18010    Sep  1 2011                   0      0       0      0
18010   Oct  1 2011                       0       0       0     0
18010   Nov  1 2011                      0      0       0      0
18010   Dec  1 2011                       1      0       0      0


id               ind1_sep2011     ind1_oct2011   ind1_nov2011   ind1_dec2011  
18010             0                           0                          0                    1

For display purposes I just showed only ind1.What is the most efficient way to achive this format?

PROC Star
Posts: 7,356

Re: Transpose

Are you sure that is what you want?

I would think that you would want something like:

id            ind1       ind2      ind3        ind4

18010 1Dec2011     .           .             .

If not, aren't you also interested in the other three INDs and are the dates shown the only ones in your data?

Super Contributor
Posts: 647

Re: Transpose

There are fout indicators ind1--ind4 and there are 12 months.Sep2011--sep2012.All the dat columns have to be transposed thus making 48 columns plus id. Ind1 for sep2011 will be Ind_sep2011 ,ind1 for oct2011 will be ind1_oct2011 ...ind4_sep2012.

Super User
Super User
Posts: 6,495

Re: Transpose

Double transpose would seem the easiest way.

data have ;

  input id date ind1-ind4 ;

  informat date date9.;

  format date yymon7.;

cards;

18010 01SEP2011 1 2 3 4

18010 01OCT2011 5 6 7 8

18010 01NOV2011 9 1 2 3

18010 01DEC2011 4 5 6 7

;

proc transpose data=have out=tall ;

by id date;

var ind1-ind4;

run;

proc transpose data=tall out=want delimiter=_;

by id;

id _name_ date;

var col1;

run;

data _null_; set;

put (_ALL_) (=/);

run;

id=18010

_NAME_=COL1

ind1_2011SEP=1

ind2_2011SEP=2

ind3_2011SEP=3

ind4_2011SEP=4

ind1_2011OCT=5

ind2_2011OCT=6

ind3_2011OCT=7

ind4_2011OCT=8

ind1_2011NOV=9

ind2_2011NOV=1

ind3_2011NOV=2

ind4_2011NOV=3

ind1_2011DEC=4

ind2_2011DEC=5

ind3_2011DEC=6

ind4_2011DEC=7

PROC Star
Posts: 7,356

Re: Transpose

: Given the size of your file, I thought this might be a good time to introduce (and, hopefully, get some feedback regarding) a concept that KSharp came up with and he and I are formulating for a future conference paper, namely using a data step to accomplish such transpose tasks.

Here is an example using a file like yours, although I've only tested it with 1.2 million and 12 million records.  With 1.2 million records the data step method runs more than 10 times faster than using proc transpose.  With 12 million records, the difference amplified to 15 times faster and I presume that will continue to increase with even larger files like yours.

Since both this and 's proposed code require that the data be sorted, I didn't include those times in the comparison.

data have ;

  length id 8.;

  input date ind1-ind4 ;

  informat date date9.;

  format date yymon7.;

  do id=1 to 100000;

    output;

  end;

cards;

01jan2011 1 2 3 4

01feb2011 5 6 7 8

01mar2011 9 1 2 3

01apr2011 4 5 6 7

01may2011 1 2 3 4

01jun2011 5 6 7 8

01jul2011 9 1 2 3

01aug2011 4 5 6 7

01SEP2011 1 2 3 4

01OCT2011 5 6 7 8

01NOV2011 9 1 2 3

01DEC2011 4 5 6 7

;

proc sort data=have;

  by id date;

run;

/*transpose long to wide using data step*/

proc sql;

  create table temp as

    select distinct date from have

  ;

quit;

data _null_;

  set temp end=last;

  if _n_ eq 1 then call execute('data want; merge ');

  call execute(catt('have(where=(date=',date,')',

      ' rename=( ind1=ind1_',left(put(date,yymon7.)),

               ' ind2=ind2_',left(put(date,yymon7.)),

               ' ind3=ind3_',left(put(date,yymon7.)),

               ' ind4=ind4_',left(put(date,yymon7.)),'))'));

  if last then call execute('; by id; drop date; run;');

run;

/*compare with transpose using proc transpose*/

proc transpose data=have out=tall ;

by id date;

var ind1-ind4;

run;

proc transpose data=tall out=want delimiter=_;

by id;

id _name_ date;

var col1;

run;

Super Contributor
Posts: 647

Re: Transpose

Hi Arthur,

I used your data step for 50 Mil records and it completed way faster than expected. I could not keep the old Log as I had overwriiten with other process.

PROC Star
Posts: 7,356

Re: Transpose

: I'm currently testing out DN's suggestion of testing some of the limits, but quickly confronted the log situation.  That is easy to circumvent by just rerouting the log.  I wrapped the code with:

proc printto log="c:\art\testlog";

run;

/* the code */

proc printto;

run;

Respected Advisor
Posts: 3,777

Re: Transpose

Have you tested when you have lots more dates?  Is there a limit to the number of data sets that can be merged.

Or course you could probabley use multiple data steps and still be faster than proc transpose.

PROC Star
Posts: 7,356

Re: Transpose

dn: what would you suggest as "a lot more dates", specifically, how large would you suggest?

The OP for this thread only needed 12 dates, so I didn't push it.  Interestingly, using proc transpose on just the 4 replications of the 12 dates on 12 million records almost blew out my ram.

Respected Advisor
Posts: 3,777

Re: Transpose

I don't know how many.  I was thinking that for the WHEREed MERGE technique that the number of MERGE data sets is the limiting factor (or at least I think so) and I would like to know where that limit is.

PROC Star
Posts: 7,356

Re: Transpose

DN: I've tested it, both with using call execute and with creating and running an include program.

I successfully ran 256, 512 and 1024 merges, but the system crashed as 2048.  But, even with 1024, that was creating a file that was 4,097  variables wide.  I wouldn't recommend anyone creating that wide of a file anyway.

The limitations, also, may just be due to limited ram.  Can you try the following on your computer and, if it works, increasing it to 4096?:

data have (drop=months);

  format id best32.;

  input date ind1-ind4 ;

  informat date date9.;

  format date yymon7.;

  do id=1 to 2;

    date="01dec2010"d;

    do months=1 to 2048;

      date=intnx('month',date,1);

      output;

    end;

  end;

  cards;

01dec2010 1 2 3 4

;

proc printto log="c:\testlog";

run;

/*transpose long to wide using data step*/

proc sql;

  create table temp as

    select distinct date from have

  ;

quit;

data _null_;

  set temp end=last;

  if _n_ eq 1 then call execute('data want; merge ');

  call execute(catt('have (where=(date=',date,')',

      ' rename=( ind1=ind1_',left(put(date,yymon7.)),

               ' ind2=ind2_',left(put(date,yymon7.)),

               ' ind3=ind3_',left(put(date,yymon7.)),

               ' ind4=ind4_',left(put(date,yymon7.)),'))'));

  if last then call execute('; by id; drop date; run;');

run;

proc printto;

run;

Super Contributor
Posts: 647

Re: Transpose

Art,

It took over 3 hrs with 50 mil records.

PROC Star
Posts: 7,356

Re: Transpose

: Then I would guess that your input file has quite a few extra variables in it or that there is something else going on.  I tested the process on 50 million records, on an extremely old computer with extremely limited ram, and it only took 7 minutes.

Super Contributor
Posts: 647

Re: Transpose

Yes there are extra variables.

PROC Star
Posts: 7,356

Re: Transpose

: Are they the same for all of a subject's 12 records?  If not, you will have a problem with any method as only the last record's info will be kept.

If the values for those variables are the same for all 12 records, you can speed up the process by (after sorting the file) extracting the data to be transposed (i.e., id, date and the four ind variables) to one file and, for the last.id, outputting just id and the variables that aren't part of the transpose, running the transpose on the first of those files, and then merging the results with the second one.

BTW, the 7 minute proc execute solution (which took the same cpu and real time) took 40.4 minutes cpu time and over an hour real time using the double transpose method.

Ask a Question
Discussion stats
  • 30 replies
  • 598 views
  • 2 likes
  • 6 in conversation