BookmarkSubscribeRSS Feed
publichealth11
Calcite | Level 5

I have a data set with many duplicates for each observation, some duplicates have the same test results while other duplicates have different test results. 

 

for example:

 

Person1.          Positive

Person1.          Positive 

Person2.          Unknown

Person2.          Positive

Person3.          Unknown

Person3.          Missing

 

In the scenario for person2, I want to keep the duplicate with a positive test result over the unknown and for person1, I want to keep just one of the results since they are the same. Person3, I'd like to keep the unknown duplicate over the missing. I've already ordered the test results for positive =1 unk=2, missing =3. 

How can I code it to drop certain duplicates based on the test result status?

3 REPLIES 3
AMSAS
SAS Super FREQ

As long as you sort the data into the order that you want to keep the results, you can use a PROC SORT NODUPKEY to get the results

Here's an example

 

data have ;
	infile cards ;
	input 
		person $
		result $ ;
	if result="Positive" then 
		sortOrder="1" ;
	else if result="Unknown" then 
		sortOrder="2" ;
	else if result="Negative" then 
		sortOrder="3" ;

	output have ;
cards ;
Person1 Negative
Person1 Positive
Person1 Unknown
Person2 Negative
Person2 Unknown
Person3 Negative
run ;


/* First sort the records into the correct order */
proc sort 
	data=have 
	out=sort1 ;
	by person sortOrder;
run ;

/* Now remove the duplicates */
proc sort nodupkey 	
	data=sort1 
	out=want ;
	by person ;
run ;
	
PeterClemmensen
Tourmaline | Level 20

My 2 cents

 

data have;
input person $ result $;
datalines;
Person1 Positive 
Person1 Positive 
Person2 Unknown  
Person2 Positive 
Person3 Unknown  
Person3 Missing  
;

proc sql;
   create table want as
   select distinct * 
   from have
   group by person 
   having whichc(result, 'Positive', 'Unknown', 'Missing')
    = min(whichc(result, 'Positive', 'Unknown', 'Missing'))
   ;
quit;

 

Results:

 

person   result
Person1  Positive
Person2  Positive
Person3  Unknown

 

s_lassen
Meteorite | Level 14

I think it may be a good idea to check for unexpected values/errors.

 

Given data like this

data have;
input person $ result $;
datalines;
Person1 Positive 
Person1 Positive 
Person2 Unknown  
Person2 Positive 
Person3 Unknown  
Person3 Missing  
Person4 Gylle
;run;

(note that I put in a not predicted value "Gylle" in the last row)

 

One way to go about it could be this:

data want;
  array values(3) $8 _temporary_ ('Positive','Unknown','Missing');
  do until(last.person);
    set have;
    by person;
    _idx=min(_idx,whichc(result,of values(*)));
	if _idx=0 then 
	  error 'Unexpected result value: ' result;
	end;
  if _idx>0 then
    result=values(_idx);
  else 
    delete;
  drop _idx;
run;

 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 877 views
  • 0 likes
  • 4 in conversation