Dropping duplicates based on a condition?

publichealth11 · Posted 11-16-2022 12:22 PM

I have a data set with many duplicates for each observation, some duplicates have the same test results while other duplicates have different test results.

for example:

Person1. Positive

Person2. Unknown

Person2. Positive

Person3. Unknown

Person3. Missing

In the scenario for person2, I want to keep the duplicate with a positive test result over the unknown and for person1, I want to keep just one of the results since they are the same. Person3, I'd like to keep the unknown duplicate over the missing. I've already ordered the test results for positive =1 unk=2, missing =3.

How can I code it to drop certain duplicates based on the test result status?

AMSAS · Posted 11-16-2022 12:51 PM

As long as you sort the data into the order that you want to keep the results, you can use a PROC SORT NODUPKEY to get the results

Here's an example

data have ;
	infile cards ;
	input 
		person $
		result $ ;
	if result="Positive" then 
		sortOrder="1" ;
	else if result="Unknown" then 
		sortOrder="2" ;
	else if result="Negative" then 
		sortOrder="3" ;

	output have ;
cards ;
Person1 Negative
Person1 Positive
Person1 Unknown
Person2 Negative
Person2 Unknown
Person3 Negative
run ;


/* First sort the records into the correct order */
proc sort 
	data=have 
	out=sort1 ;
	by person sortOrder;
run ;

/* Now remove the duplicates */
proc sort nodupkey 	
	data=sort1 
	out=want ;
	by person ;
run ;

PeterClemmensen · Posted 11-16-2022 01:09 PM

My 2 cents

data have;
input person $ result $;
datalines;
Person1 Positive 
Person1 Positive 
Person2 Unknown  
Person2 Positive 
Person3 Unknown  
Person3 Missing  
;

proc sql;
   create table want as
   select distinct * 
   from have
   group by person 
   having whichc(result, 'Positive', 'Unknown', 'Missing')
    = min(whichc(result, 'Positive', 'Unknown', 'Missing'))
   ;
quit;

Results:

person   result
Person1  Positive
Person2  Positive
Person3  Unknown

The DATA to DATA Step Macro
Blog: SASnrd

s_lassen · Posted 11-23-2022 09:44 AM

I think it may be a good idea to check for unexpected values/errors.

Given data like this

data have;
input person $ result $;
datalines;
Person1 Positive 
Person1 Positive 
Person2 Unknown  
Person2 Positive 
Person3 Unknown  
Person3 Missing  
Person4 Gylle
;run;

(note that I put in a not predicted value "Gylle" in the last row)

One way to go about it could be this:

data want;
  array values(3) $8 _temporary_ ('Positive','Unknown','Missing');
  do until(last.person);
    set have;
    by person;
    _idx=min(_idx,whichc(result,of values(*)));
	if _idx=0 then 
	  error 'Unexpected result value: ' result;
	end;
  if _idx>0 then
    result=values(_idx);
  else 
    delete;
  drop _idx;
run;

Dropping duplicates based on a condition?

Re: Dropping duplicates based on a condition?

Re: Dropping duplicates based on a condition?

Re: Dropping duplicates based on a condition?

Dropping duplicates based on a condition?

Re: Dropping duplicates based on a condition?

Re: Dropping duplicates based on a condition?

Re: Dropping duplicates based on a condition?

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away