BookmarkSubscribeRSS Feed
SASPhile
Quartz | Level 8

There are 700 million records. The columns are to transformed to this way:
id           date                         ind1   ind2   ind3   ind4
18010    Sep  1 2011                   0      0       0      0
18010   Oct  1 2011                       0       0       0     0
18010   Nov  1 2011                      0      0       0      0
18010   Dec  1 2011                       1      0       0      0


id               ind1_sep2011     ind1_oct2011   ind1_nov2011   ind1_dec2011  
18010             0                           0                          0                    1

For display purposes I just showed only ind1.What is the most efficient way to achive this format?

30 REPLIES 30
art297
Opal | Level 21

Are you sure that is what you want?

I would think that you would want something like:

id            ind1       ind2      ind3        ind4

18010 1Dec2011     .           .             .

If not, aren't you also interested in the other three INDs and are the dates shown the only ones in your data?

SASPhile
Quartz | Level 8

There are fout indicators ind1--ind4 and there are 12 months.Sep2011--sep2012.All the dat columns have to be transposed thus making 48 columns plus id. Ind1 for sep2011 will be Ind_sep2011 ,ind1 for oct2011 will be ind1_oct2011 ...ind4_sep2012.

Tom
Super User Tom
Super User

Double transpose would seem the easiest way.

data have ;

  input id date ind1-ind4 ;

  informat date date9.;

  format date yymon7.;

cards;

18010 01SEP2011 1 2 3 4

18010 01OCT2011 5 6 7 8

18010 01NOV2011 9 1 2 3

18010 01DEC2011 4 5 6 7

;

proc transpose data=have out=tall ;

by id date;

var ind1-ind4;

run;

proc transpose data=tall out=want delimiter=_;

by id;

id _name_ date;

var col1;

run;

data _null_; set;

put (_ALL_) (=/);

run;

id=18010

_NAME_=COL1

ind1_2011SEP=1

ind2_2011SEP=2

ind3_2011SEP=3

ind4_2011SEP=4

ind1_2011OCT=5

ind2_2011OCT=6

ind3_2011OCT=7

ind4_2011OCT=8

ind1_2011NOV=9

ind2_2011NOV=1

ind3_2011NOV=2

ind4_2011NOV=3

ind1_2011DEC=4

ind2_2011DEC=5

ind3_2011DEC=6

ind4_2011DEC=7

art297
Opal | Level 21

: Given the size of your file, I thought this might be a good time to introduce (and, hopefully, get some feedback regarding) a concept that KSharp came up with and he and I are formulating for a future conference paper, namely using a data step to accomplish such transpose tasks.

Here is an example using a file like yours, although I've only tested it with 1.2 million and 12 million records.  With 1.2 million records the data step method runs more than 10 times faster than using proc transpose.  With 12 million records, the difference amplified to 15 times faster and I presume that will continue to increase with even larger files like yours.

Since both this and 's proposed code require that the data be sorted, I didn't include those times in the comparison.

data have ;

  length id 8.;

  input date ind1-ind4 ;

  informat date date9.;

  format date yymon7.;

  do id=1 to 100000;

    output;

  end;

cards;

01jan2011 1 2 3 4

01feb2011 5 6 7 8

01mar2011 9 1 2 3

01apr2011 4 5 6 7

01may2011 1 2 3 4

01jun2011 5 6 7 8

01jul2011 9 1 2 3

01aug2011 4 5 6 7

01SEP2011 1 2 3 4

01OCT2011 5 6 7 8

01NOV2011 9 1 2 3

01DEC2011 4 5 6 7

;

proc sort data=have;

  by id date;

run;

/*transpose long to wide using data step*/

proc sql;

  create table temp as

    select distinct date from have

  ;

quit;

data _null_;

  set temp end=last;

  if _n_ eq 1 then call execute('data want; merge ');

  call execute(catt('have(where=(date=',date,')',

      ' rename=( ind1=ind1_',left(put(date,yymon7.)),

               ' ind2=ind2_',left(put(date,yymon7.)),

               ' ind3=ind3_',left(put(date,yymon7.)),

               ' ind4=ind4_',left(put(date,yymon7.)),'))'));

  if last then call execute('; by id; drop date; run;');

run;

/*compare with transpose using proc transpose*/

proc transpose data=have out=tall ;

by id date;

var ind1-ind4;

run;

proc transpose data=tall out=want delimiter=_;

by id;

id _name_ date;

var col1;

run;

SASPhile
Quartz | Level 8

Hi Arthur,

I used your data step for 50 Mil records and it completed way faster than expected. I could not keep the old Log as I had overwriiten with other process.

art297
Opal | Level 21

: I'm currently testing out DN's suggestion of testing some of the limits, but quickly confronted the log situation.  That is easy to circumvent by just rerouting the log.  I wrapped the code with:

proc printto log="c:\art\testlog";

run;

/* the code */

proc printto;

run;

data_null__
Jade | Level 19

Have you tested when you have lots more dates?  Is there a limit to the number of data sets that can be merged.

Or course you could probabley use multiple data steps and still be faster than proc transpose.

art297
Opal | Level 21

dn: what would you suggest as "a lot more dates", specifically, how large would you suggest?

The OP for this thread only needed 12 dates, so I didn't push it.  Interestingly, using proc transpose on just the 4 replications of the 12 dates on 12 million records almost blew out my ram.

data_null__
Jade | Level 19

I don't know how many.  I was thinking that for the WHEREed MERGE technique that the number of MERGE data sets is the limiting factor (or at least I think so) and I would like to know where that limit is.

art297
Opal | Level 21

DN: I've tested it, both with using call execute and with creating and running an include program.

I successfully ran 256, 512 and 1024 merges, but the system crashed as 2048.  But, even with 1024, that was creating a file that was 4,097  variables wide.  I wouldn't recommend anyone creating that wide of a file anyway.

The limitations, also, may just be due to limited ram.  Can you try the following on your computer and, if it works, increasing it to 4096?:

data have (drop=months);

  format id best32.;

  input date ind1-ind4 ;

  informat date date9.;

  format date yymon7.;

  do id=1 to 2;

    date="01dec2010"d;

    do months=1 to 2048;

      date=intnx('month',date,1);

      output;

    end;

  end;

  cards;

01dec2010 1 2 3 4

;

proc printto log="c:\testlog";

run;

/*transpose long to wide using data step*/

proc sql;

  create table temp as

    select distinct date from have

  ;

quit;

data _null_;

  set temp end=last;

  if _n_ eq 1 then call execute('data want; merge ');

  call execute(catt('have (where=(date=',date,')',

      ' rename=( ind1=ind1_',left(put(date,yymon7.)),

               ' ind2=ind2_',left(put(date,yymon7.)),

               ' ind3=ind3_',left(put(date,yymon7.)),

               ' ind4=ind4_',left(put(date,yymon7.)),'))'));

  if last then call execute('; by id; drop date; run;');

run;

proc printto;

run;

SASPhile
Quartz | Level 8

Art,

It took over 3 hrs with 50 mil records.

art297
Opal | Level 21

: Then I would guess that your input file has quite a few extra variables in it or that there is something else going on.  I tested the process on 50 million records, on an extremely old computer with extremely limited ram, and it only took 7 minutes.

SASPhile
Quartz | Level 8

Yes there are extra variables.

art297
Opal | Level 21

: Are they the same for all of a subject's 12 records?  If not, you will have a problem with any method as only the last record's info will be kept.

If the values for those variables are the same for all 12 records, you can speed up the process by (after sorting the file) extracting the data to be transposed (i.e., id, date and the four ind variables) to one file and, for the last.id, outputting just id and the variables that aren't part of the transpose, running the transpose on the first of those files, and then merging the results with the second one.

BTW, the 7 minute proc execute solution (which took the same cpu and real time) took 40.4 minutes cpu time and over an hour real time using the double transpose method.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 30 replies
  • 1593 views
  • 2 likes
  • 6 in conversation