BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ameriel
Fluorite | Level 6

Hello everyone, I am new to SAS and data management in general so I would really appreciate any help! 

 

I have a dataset with multiple rows per subject and multiple columns of diagnoses codes for each row. For each subject i want to set any duplicate diagnoses as missing. The problem is that the duplicate diagnoses are in different rows. So I have:

 

Subject    D1    D2   D3   D4   D5 

1               a      b     c       d      e

1               a      d      .      .       .

1               f      a     b       .       .

2               g      c    h        e      d

2               e      d      j       .         .

 

and  want:

 

Subject    D1    D2   D3   D4   D5 

1               a      b     c       d      e

1               .      .      .      .       .

1               f      .      .      .       .

2               g      c    h        e      d

2               .      .      j       .         .

 

I can transpose from long to wide format and then use an array to compare among codes, set duplicates as missing and transpose back to long. But this is inefficient however as in the wide format i will end up with more than 2000 columns. Is there a more efficient way?  

 

Thank you in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20
data have;
input (Subject    D1    D2   D3   D4   D5 ) ($);
cards;
1               a      b     c       d      e
1               a      d      .      .       .
1               f      a     b       .       .
2               g      c    h        e      d
2               e      d      j       .         .
;


data want;
 if _n_ = 1 then do ;                          
    dcl hash h ( multidata:"y") ;
    h.definekey ("subject",'d') ;                   
    h.definedata ("subject",'d') ; 
	h.definedone(); 
	end;
set have ;
by subject;
array t(*) d1-d5;
do n=1 to dim(t);
d=t(n);
rc=h.check();
h.add(); 
if rc=0  then call missing(t(n));
end;
drop rc n d;
run;

View solution in original post

6 REPLIES 6
novinosrin
Tourmaline | Level 20
data have;
input (Subject    D1    D2   D3   D4   D5 ) ($);
cards;
1               a      b     c       d      e
1               a      d      .      .       .
1               f      a     b       .       .
2               g      c    h        e      d
2               e      d      j       .         .
;


data want;
 if _n_ = 1 then do ;                          
    dcl hash h ( multidata:"y") ;
    h.definekey ("subject",'d') ;                   
    h.definedata ("subject",'d') ; 
	h.definedone(); 
	end;
set have ;
by subject;
array t(*) d1-d5;
do n=1 to dim(t);
d=t(n);
rc=h.check();
h.add(); 
if rc=0  then call missing(t(n));
end;
drop rc n d;
run;
ameriel
Fluorite | Level 6

Thank you so much! Worked like a charm and it is so efficient! 

mkeintz
PROC Star

Hash can be useful, but for this nail, I don't think you need that big a hammer.    Just keep a retained space-separated list of d values (variables _d_list) for a given id.  Then check subsequent d values against that list, with the result of either expanding the list with a new d value, or setting the d value to missing.

 

data have;
input (Subject    D1    D2   D3   D4   D5 ) ($);
cards;
1               a      b     c       d      e
1               a      d      .      .       .
1               f      a     b       .       .
2               g      c    h        e      d
2               e      d      j       .         .
;

data want;
  set have;
  by subject;
  length _d_list $300;
  retain _d_list;
  array d {*} d1-d5;
  if first.subject then _d_list=catx(' ',of d{*});
  else do _i=1 to dim(d) while (d{_i}^=' ');
    if findw(trim(_d_list),trim(d{_i}))=0 then _d_list=catx(' ',_d_list,d{_i});
    else d{_i}=' ';
  end;
  drop _: ;
run;

 

This program assumes (1) the data are sorted by subject, (2) the first record for each subject has no duplicate d-values, (3) that obeservations with less the 5 non-missing d-values have all the non-missing at the left end of the d-variables.

 

Make sure the length of _d_list is sufficient to hold the maximum number of space-separated distinct d values.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
quickbluefish
Obsidian | Level 7

Easiest would probably be to just transpose vertically and deduplicate with PROC SORT by subject and diagnosis.  But if you want to keep the original structure, maybe this:

 

proc sort data=have; by subject; run;

proc sql noprint;
select max(n)*5 into :maxD from	
	(select subject, count(*) as n 
	from have group by subject);
quit;

data have;
set have;
by subject;
array U {&maxD} $10 _temporary_;
array d {*} D1-D5;
if first.subject then do;
	do i=1 to dim(U);
		U[i]='';
	end;
	n=0;
end;
do i=1 to dim(d);
	if not missing(d[i]) then do;
		if d[i] in U then d[i]='';
		else do;
			n+1;
			U[n]=d[i];
		end;
	end;
end;
run;
novinosrin
Tourmaline | Level 20
/*a little cleaner than previous*/
data want;
 if _n_ = 1 then do ;                          
    declare hash h ( ) ;*further edited;
    h.definekey ("subject",'d') ;                   
    h.definedata ("subject",'d') ; 
	h.definedone(); 
	end;
set have ;
by subject;
array t(*) d1-d5;
do n=1 to dim(t);
d=t(n);
rc=h.check();
h.replace();  *further edited;
if rc=0  then call missing(t(n));
end;
drop rc n d;
run;

 

ameriel
Fluorite | Level 6

Thanks so much everyone for the help, I made it work clean and fast! 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 2502 views
  • 1 like
  • 4 in conversation