Hello everybody,
I have sample code as below,
Data Have;
Length ID 8 Variable $ 32;
Infile Datalines Missover;
Input ID Variable;
Datalines;
1 AAA
1 BBB
1 CCC
2 AAA
2 BBB
2 CCC
;
PROC SORT DATA=HAVE(KEEP=Variable ID) OUT=WANT ;
BY ID;
RUN;
PROC TRANSPOSE DATA=WANT OUT=WANT(drop=_NAME_)
PREFIX=Test;
BY ID;
VAR Variable;
RUN; QUIT;
But in real life, when I want to do this process with million values, it takes a lot of time so I want your help to maker it faster. As you can see there is multiplexing problem in my code and I could solve it with PROC SORT and PROC TRANSPOSE. Can somebody help me do it better?
Thanks
Just a million?or over a million?
Or millions?
Or 100's of millions or billion?
And is it always sets of 3 for each BY group like what your sample suggests?
The only reason you are sorting is because dataset HAVE is not sorted. If it's a big data set then a single DATA step using a hash object may be faster than SORT then TRANSPOSE:
data template;
set have (keep=id);
length test1-test30 $32;
t=0;
stop;
run;
data _null_;
retain max_t;
if _n_=1 then do;
if 0 then set template;
declare hash h (dataset:'template',ordered:'A');
h.definekey('id');
h.definedata(all:'Y');
h.definedone();
end;
array tst {*} $32 test:;
set have end=end_of_have;
if h.find()=0 then t=t+1;
else t=1;
tst{t}=variable;
h.replace();
max_t=max(max_t,t);
if end_of_have then do;
want_descriptor=cats('WANT (keep=ID test1-test',max_t,')');
rc=h.output(dataset:want_descriptor);
end;
run;
The value here is to save on disk input/output, at the expense of memory for the hash object.
Thank you for your invaluable response. It seems work, however, I am not familiar with Hash code, can you please explain more deeply(step by step maybe)? @mkeintz @novinosrin On the other hand, our data sets 50 million rows. On the other hand, are there any alternative methods except from Hash?
Thanks
Data Have;
Length ID 8 Variable $ 32;
Infile Datalines Missover;
Input ID Variable;
Datalines;
1 AAA
1 BBB
1 CCC
2 AAA
2 BBB
2 CCC
;
proc sql noprint;
create index id on have;
select max(n) into : n
from (select count(*) as n from have group by id);
quit;
data want;
do i=1 by 1 until(last.id);
set have;
by id;
array x{*} $ 80 test1-test%left(&n);
x{i}=variable;
end;
drop i variable;
run;
@ertr wrote:
Hello everybody,
I have sample code as below,
Data Have; Length ID 8 Variable $ 32; Infile Datalines Missover; Input ID Variable; Datalines; 1 AAA 1 BBB 1 CCC 2 AAA 2 BBB 2 CCC ; PROC SORT DATA=HAVE(KEEP=Variable ID) OUT=WANT ; BY ID; RUN; PROC TRANSPOSE DATA=WANT OUT=WANT(drop=_NAME_) PREFIX=Test; BY ID; VAR Variable; RUN; QUIT;
But in real life, when I want to do this process with million values, it takes a lot of time so I want your help to maker it faster. As you can see there is multiplexing problem in my code and I could solve it with PROC SORT and PROC TRANSPOSE. Can somebody help me do it better?
Thanks
Your example data implies that the values of variable may repeat for all your ID values (though possibly in a different order). Is that actually the case? If so selecting the distinct values of the variable, transposing and join to the id list might be another approach.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.