There are 700 million records. The columns are to transformed to this way:
id date ind1 ind2 ind3 ind4
18010 Sep 1 2011 0 0 0 0
18010 Oct 1 2011 0 0 0 0
18010 Nov 1 2011 0 0 0 0
18010 Dec 1 2011 1 0 0 0
id ind1_sep2011 ind1_oct2011 ind1_nov2011 ind1_dec2011
18010 0 0 0 1
For display purposes I just showed only ind1.What is the most efficient way to achive this format?
Are you sure that is what you want?
I would think that you would want something like:
id ind1 ind2 ind3 ind4
18010 1Dec2011 . . .
If not, aren't you also interested in the other three INDs and are the dates shown the only ones in your data?
There are fout indicators ind1--ind4 and there are 12 months.Sep2011--sep2012.All the dat columns have to be transposed thus making 48 columns plus id. Ind1 for sep2011 will be Ind_sep2011 ,ind1 for oct2011 will be ind1_oct2011 ...ind4_sep2012.
Double transpose would seem the easiest way.
data have ;
input id date ind1-ind4 ;
informat date date9.;
format date yymon7.;
cards;
18010 01SEP2011 1 2 3 4
18010 01OCT2011 5 6 7 8
18010 01NOV2011 9 1 2 3
18010 01DEC2011 4 5 6 7
;
proc transpose data=have out=tall ;
by id date;
var ind1-ind4;
run;
proc transpose data=tall out=want delimiter=_;
by id;
id _name_ date;
var col1;
run;
data _null_; set;
put (_ALL_) (=/);
run;
id=18010
_NAME_=COL1
ind1_2011SEP=1
ind2_2011SEP=2
ind3_2011SEP=3
ind4_2011SEP=4
ind1_2011OCT=5
ind2_2011OCT=6
ind3_2011OCT=7
ind4_2011OCT=8
ind1_2011NOV=9
ind2_2011NOV=1
ind3_2011NOV=2
ind4_2011NOV=3
ind1_2011DEC=4
ind2_2011DEC=5
ind3_2011DEC=6
ind4_2011DEC=7
: Given the size of your file, I thought this might be a good time to introduce (and, hopefully, get some feedback regarding) a concept that KSharp came up with and he and I are formulating for a future conference paper, namely using a data step to accomplish such transpose tasks.
Here is an example using a file like yours, although I've only tested it with 1.2 million and 12 million records. With 1.2 million records the data step method runs more than 10 times faster than using proc transpose. With 12 million records, the difference amplified to 15 times faster and I presume that will continue to increase with even larger files like yours.
Since both this and 's proposed code require that the data be sorted, I didn't include those times in the comparison.
data have ;
length id 8.;
input date ind1-ind4 ;
informat date date9.;
format date yymon7.;
do id=1 to 100000;
output;
end;
cards;
01jan2011 1 2 3 4
01feb2011 5 6 7 8
01mar2011 9 1 2 3
01apr2011 4 5 6 7
01may2011 1 2 3 4
01jun2011 5 6 7 8
01jul2011 9 1 2 3
01aug2011 4 5 6 7
01SEP2011 1 2 3 4
01OCT2011 5 6 7 8
01NOV2011 9 1 2 3
01DEC2011 4 5 6 7
;
proc sort data=have;
by id date;
run;
/*transpose long to wide using data step*/
proc sql;
create table temp as
select distinct date from have
;
quit;
data _null_;
set temp end=last;
if _n_ eq 1 then call execute('data want; merge ');
call execute(catt('have(where=(date=',date,')',
' rename=( ind1=ind1_',left(put(date,yymon7.)),
' ind2=ind2_',left(put(date,yymon7.)),
' ind3=ind3_',left(put(date,yymon7.)),
' ind4=ind4_',left(put(date,yymon7.)),'))'));
if last then call execute('; by id; drop date; run;');
run;
/*compare with transpose using proc transpose*/
proc transpose data=have out=tall ;
by id date;
var ind1-ind4;
run;
proc transpose data=tall out=want delimiter=_;
by id;
id _name_ date;
var col1;
run;
Hi Arthur,
I used your data step for 50 Mil records and it completed way faster than expected. I could not keep the old Log as I had overwriiten with other process.
Have you tested when you have lots more dates? Is there a limit to the number of data sets that can be merged.
Or course you could probabley use multiple data steps and still be faster than proc transpose.
dn: what would you suggest as "a lot more dates", specifically, how large would you suggest?
The OP for this thread only needed 12 dates, so I didn't push it. Interestingly, using proc transpose on just the 4 replications of the 12 dates on 12 million records almost blew out my ram.
I don't know how many. I was thinking that for the WHEREed MERGE technique that the number of MERGE data sets is the limiting factor (or at least I think so) and I would like to know where that limit is.
DN: I've tested it, both with using call execute and with creating and running an include program.
I successfully ran 256, 512 and 1024 merges, but the system crashed as 2048. But, even with 1024, that was creating a file that was 4,097 variables wide. I wouldn't recommend anyone creating that wide of a file anyway.
The limitations, also, may just be due to limited ram. Can you try the following on your computer and, if it works, increasing it to 4096?:
data have (drop=months);
format id best32.;
input date ind1-ind4 ;
informat date date9.;
format date yymon7.;
do id=1 to 2;
date="01dec2010"d;
do months=1 to 2048;
date=intnx('month',date,1);
output;
end;
end;
cards;
01dec2010 1 2 3 4
;
proc printto log="c:\testlog";
run;
/*transpose long to wide using data step*/
proc sql;
create table temp as
select distinct date from have
;
quit;
data _null_;
set temp end=last;
if _n_ eq 1 then call execute('data want; merge ');
call execute(catt('have (where=(date=',date,')',
' rename=( ind1=ind1_',left(put(date,yymon7.)),
' ind2=ind2_',left(put(date,yymon7.)),
' ind3=ind3_',left(put(date,yymon7.)),
' ind4=ind4_',left(put(date,yymon7.)),'))'));
if last then call execute('; by id; drop date; run;');
run;
proc printto;
run;
Art,
It took over 3 hrs with 50 mil records.
Yes there are extra variables.
: Are they the same for all of a subject's 12 records? If not, you will have a problem with any method as only the last record's info will be kept.
If the values for those variables are the same for all 12 records, you can speed up the process by (after sorting the file) extracting the data to be transposed (i.e., id, date and the four ind variables) to one file and, for the last.id, outputting just id and the variables that aren't part of the transpose, running the transpose on the first of those files, and then merging the results with the second one.
BTW, the 7 minute proc execute solution (which took the same cpu and real time) took 40.4 minutes cpu time and over an hour real time using the double transpose method.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.