Solved: Re: Comparing values accross rows and columns

ameriel · Posted 07-27-2018 07:14 PM

Hello everyone, I am new to SAS and data management in general so I would really appreciate any help!

I have a dataset with multiple rows per subject and multiple columns of diagnoses codes for each row. For each subject i want to set any duplicate diagnoses as missing. The problem is that the duplicate diagnoses are in different rows. So I have:

Subject D1 D2 D3 D4 D5

1 a b c d e

1 a d . . .

1 f a b . .

2 g c h e d

2 e d j . .

and want:

Subject D1 D2 D3 D4 D5

1 a b c d e

1 . . . . .

1 f . . . .

2 g c h e d

2 . . j . .

I can transpose from long to wide format and then use an array to compare among codes, set duplicates as missing and transpose back to long. But this is inefficient however as in the wide format i will end up with more than 2000 columns. Is there a more efficient way?

Thank you in advance!

novinosrin · Posted 07-27-2018 08:33 PM

data have;
input (Subject    D1    D2   D3   D4   D5 ) ($);
cards;
1               a      b     c       d      e
1               a      d      .      .       .
1               f      a     b       .       .
2               g      c    h        e      d
2               e      d      j       .         .
;


data want;
 if _n_ = 1 then do ;                          
    dcl hash h ( multidata:"y") ;
    h.definekey ("subject",'d') ;                   
    h.definedata ("subject",'d') ; 
	h.definedone(); 
	end;
set have ;
by subject;
array t(*) d1-d5;
do n=1 to dim(t);
d=t(n);
rc=h.check();
h.add(); 
if rc=0  then call missing(t(n));
end;
drop rc n d;
run;

View solution in original post

novinosrin · Posted 07-27-2018 08:33 PM

data have;
input (Subject    D1    D2   D3   D4   D5 ) ($);
cards;
1               a      b     c       d      e
1               a      d      .      .       .
1               f      a     b       .       .
2               g      c    h        e      d
2               e      d      j       .         .
;


data want;
 if _n_ = 1 then do ;                          
    dcl hash h ( multidata:"y") ;
    h.definekey ("subject",'d') ;                   
    h.definedata ("subject",'d') ; 
	h.definedone(); 
	end;
set have ;
by subject;
array t(*) d1-d5;
do n=1 to dim(t);
d=t(n);
rc=h.check();
h.add(); 
if rc=0  then call missing(t(n));
end;
drop rc n d;
run;

ameriel · Posted 07-27-2018 09:07 PM

Thank you so much! Worked like a charm and it is so efficient!

mkeintz · Posted 07-27-2018 09:12 PM

Hash can be useful, but for this nail, I don't think you need that big a hammer. Just keep a retained space-separated list of d values (variables _d_list) for a given id. Then check subsequent d values against that list, with the result of either expanding the list with a new d value, or setting the d value to missing.

data have;
input (Subject    D1    D2   D3   D4   D5 ) ($);
cards;
1               a      b     c       d      e
1               a      d      .      .       .
1               f      a     b       .       .
2               g      c    h        e      d
2               e      d      j       .         .
;

data want;
  set have;
  by subject;
  length _d_list $300;
  retain _d_list;
  array d {*} d1-d5;
  if first.subject then _d_list=catx(' ',of d{*});
  else do _i=1 to dim(d) while (d{_i}^=' ');
    if findw(trim(_d_list),trim(d{_i}))=0 then _d_list=catx(' ',_d_list,d{_i});
    else d{_i}=' ';
  end;
  drop _: ;
run;

This program assumes (1) the data are sorted by subject, (2) the first record for each subject has no duplicate d-values, (3) that obeservations with less the 5 non-missing d-values have all the non-missing at the left end of the d-variables.

Make sure the length of _d_list is sufficient to hold the maximum number of space-separated distinct d values.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

quickbluefish · Posted 07-27-2018 11:04 PM

Easiest would probably be to just transpose vertically and deduplicate with PROC SORT by subject and diagnosis. But if you want to keep the original structure, maybe this:

proc sort data=have; by subject; run;

proc sql noprint;
select max(n)*5 into :maxD from	
	(select subject, count(*) as n 
	from have group by subject);
quit;

data have;
set have;
by subject;
array U {&maxD} $10 _temporary_;
array d {*} D1-D5;
if first.subject then do;
	do i=1 to dim(U);
		U[i]='';
	end;
	n=0;
end;
do i=1 to dim(d);
	if not missing(d[i]) then do;
		if d[i] in U then d[i]='';
		else do;
			n+1;
			U[n]=d[i];
		end;
	end;
end;
run;

novinosrin · Posted 07-27-2018 11:04 PM

/*a little cleaner than previous*/
data want;
 if _n_ = 1 then do ;                          
    declare hash h ( ) ;*further edited;
    h.definekey ("subject",'d') ;                   
    h.definedata ("subject",'d') ; 
	h.definedone(); 
	end;
set have ;
by subject;
array t(*) d1-d5;
do n=1 to dim(t);
d=t(n);
rc=h.check();
h.replace();  *further edited;
if rc=0  then call missing(t(n));
end;
drop rc n d;
run;

ameriel · Posted 07-28-2018 05:19 PM

Thanks so much everyone for the help, I made it work clean and fast!

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away