I have a data set with many duplicates for each observation, some duplicates have the same test results while other duplicates have different test results.
for example:
Person1. Positive
Person1. Positive
Person2. Unknown
Person2. Positive
Person3. Unknown
Person3. Missing
In the scenario for person2, I want to keep the duplicate with a positive test result over the unknown and for person1, I want to keep just one of the results since they are the same. Person3, I'd like to keep the unknown duplicate over the missing. I've already ordered the test results for positive =1 unk=2, missing =3.
How can I code it to drop certain duplicates based on the test result status?
As long as you sort the data into the order that you want to keep the results, you can use a PROC SORT NODUPKEY to get the results
Here's an example
data have ;
infile cards ;
input
person $
result $ ;
if result="Positive" then
sortOrder="1" ;
else if result="Unknown" then
sortOrder="2" ;
else if result="Negative" then
sortOrder="3" ;
output have ;
cards ;
Person1 Negative
Person1 Positive
Person1 Unknown
Person2 Negative
Person2 Unknown
Person3 Negative
run ;
/* First sort the records into the correct order */
proc sort
data=have
out=sort1 ;
by person sortOrder;
run ;
/* Now remove the duplicates */
proc sort nodupkey
data=sort1
out=want ;
by person ;
run ;
My 2 cents
data have;
input person $ result $;
datalines;
Person1 Positive
Person1 Positive
Person2 Unknown
Person2 Positive
Person3 Unknown
Person3 Missing
;
proc sql;
create table want as
select distinct *
from have
group by person
having whichc(result, 'Positive', 'Unknown', 'Missing')
= min(whichc(result, 'Positive', 'Unknown', 'Missing'))
;
quit;
Results:
person result Person1 Positive Person2 Positive Person3 Unknown
I think it may be a good idea to check for unexpected values/errors.
Given data like this
data have;
input person $ result $;
datalines;
Person1 Positive
Person1 Positive
Person2 Unknown
Person2 Positive
Person3 Unknown
Person3 Missing
Person4 Gylle
;run;
(note that I put in a not predicted value "Gylle" in the last row)
One way to go about it could be this:
data want;
array values(3) $8 _temporary_ ('Positive','Unknown','Missing');
do until(last.person);
set have;
by person;
_idx=min(_idx,whichc(result,of values(*)));
if _idx=0 then
error 'Unexpected result value: ' result;
end;
if _idx>0 then
result=values(_idx);
else
delete;
drop _idx;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.