Hello everyone, I am new to SAS and data management in general so I would really appreciate any help!
I have a dataset with multiple rows per subject and multiple columns of diagnoses codes for each row. For each subject i want to set any duplicate diagnoses as missing. The problem is that the duplicate diagnoses are in different rows. So I have:
Subject D1 D2 D3 D4 D5
1 a b c d e
1 a d . . .
1 f a b . .
2 g c h e d
2 e d j . .
and want:
Subject D1 D2 D3 D4 D5
1 a b c d e
1 . . . . .
1 f . . . .
2 g c h e d
2 . . j . .
I can transpose from long to wide format and then use an array to compare among codes, set duplicates as missing and transpose back to long. But this is inefficient however as in the wide format i will end up with more than 2000 columns. Is there a more efficient way?
Thank you in advance!
data have;
input (Subject D1 D2 D3 D4 D5 ) ($);
cards;
1 a b c d e
1 a d . . .
1 f a b . .
2 g c h e d
2 e d j . .
;
data want;
if _n_ = 1 then do ;
dcl hash h ( multidata:"y") ;
h.definekey ("subject",'d') ;
h.definedata ("subject",'d') ;
h.definedone();
end;
set have ;
by subject;
array t(*) d1-d5;
do n=1 to dim(t);
d=t(n);
rc=h.check();
h.add();
if rc=0 then call missing(t(n));
end;
drop rc n d;
run;
data have;
input (Subject D1 D2 D3 D4 D5 ) ($);
cards;
1 a b c d e
1 a d . . .
1 f a b . .
2 g c h e d
2 e d j . .
;
data want;
if _n_ = 1 then do ;
dcl hash h ( multidata:"y") ;
h.definekey ("subject",'d') ;
h.definedata ("subject",'d') ;
h.definedone();
end;
set have ;
by subject;
array t(*) d1-d5;
do n=1 to dim(t);
d=t(n);
rc=h.check();
h.add();
if rc=0 then call missing(t(n));
end;
drop rc n d;
run;
Thank you so much! Worked like a charm and it is so efficient!
Hash can be useful, but for this nail, I don't think you need that big a hammer. Just keep a retained space-separated list of d values (variables _d_list) for a given id. Then check subsequent d values against that list, with the result of either expanding the list with a new d value, or setting the d value to missing.
data have;
input (Subject D1 D2 D3 D4 D5 ) ($);
cards;
1 a b c d e
1 a d . . .
1 f a b . .
2 g c h e d
2 e d j . .
;
data want;
set have;
by subject;
length _d_list $300;
retain _d_list;
array d {*} d1-d5;
if first.subject then _d_list=catx(' ',of d{*});
else do _i=1 to dim(d) while (d{_i}^=' ');
if findw(trim(_d_list),trim(d{_i}))=0 then _d_list=catx(' ',_d_list,d{_i});
else d{_i}=' ';
end;
drop _: ;
run;
This program assumes (1) the data are sorted by subject, (2) the first record for each subject has no duplicate d-values, (3) that obeservations with less the 5 non-missing d-values have all the non-missing at the left end of the d-variables.
Make sure the length of _d_list is sufficient to hold the maximum number of space-separated distinct d values.
Easiest would probably be to just transpose vertically and deduplicate with PROC SORT by subject and diagnosis. But if you want to keep the original structure, maybe this:
proc sort data=have; by subject; run;
proc sql noprint;
select max(n)*5 into :maxD from
(select subject, count(*) as n
from have group by subject);
quit;
data have;
set have;
by subject;
array U {&maxD} $10 _temporary_;
array d {*} D1-D5;
if first.subject then do;
do i=1 to dim(U);
U[i]='';
end;
n=0;
end;
do i=1 to dim(d);
if not missing(d[i]) then do;
if d[i] in U then d[i]='';
else do;
n+1;
U[n]=d[i];
end;
end;
end;
run;
/*a little cleaner than previous*/
data want;
if _n_ = 1 then do ;
declare hash h ( ) ;*further edited;
h.definekey ("subject",'d') ;
h.definedata ("subject",'d') ;
h.definedone();
end;
set have ;
by subject;
array t(*) d1-d5;
do n=1 to dim(t);
d=t(n);
rc=h.check();
h.replace(); *further edited;
if rc=0 then call missing(t(n));
end;
drop rc n d;
run;
Thanks so much everyone for the help, I made it work clean and fast!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.