get rid of duplicates randomly

Accepted Solution Solved
Reply
Regular Contributor
Posts: 161
Accepted Solution

get rid of duplicates randomly

I merged two datasets.  Unfortunately, the matching criteria is one-to-one because of inherent data limitation.  Traditionally, I just use proc sort nodupkey option to get rid of the duplicates. 

But in this case, I hope to get rid of the duplicates randomly.  Suppose that there are four variables: patient_id, doctor_id, date, blood_pressure.  The first three variables should uniquely identify each row, but that is not the case for now because duplicates.  Now, I want to get rid of the duplicates in a way that the blood_pressure is selected randomly for given patient_id, doctor_id, and date. 


How would I best do that?


Accepted Solutions
Solution
‎05-17-2014 10:01 AM
Respected Advisor
Posts: 4,641

Re: get rid of duplicates randomly


All Replies
Super User
Posts: 17,750

Re: get rid of duplicates randomly

I'll add this may not be a good idea and you may want to consider why a person has multiple. If you can identify the time of the BP reading, then you should decide which to use.

For example, if a person came into the ER they may take BP both before admission, when doctor sees you (several hours later here in free medicare land), and before you leave.

Or if the first reading seems wonky take a second reading.

1. Random selection using proc surveyselect by the first three variables.

2. Create a random variable across the multiples, sort by the random variable and then take the top occurrence.

data random1;

set have;

by patient_id doctor_id date;

random_number=rand('normal');

run;

proc sort data=random1; by patient_id doctor_id date random_number;

run;

data unique;

set random1;

by patient_id doctor_id date;

if first.date;

run;

Regular Contributor
Posts: 161

Re: get rid of duplicates randomly

Hi, Reeza: 

I'm trying out your great idea! 

Just to be sure I don't want to lose any unique patent_id, doctor_id, and date observation.  The SAS code you included is for the method 2 only, and your method 1 and method 2 are independent of each other.  Am I right?  If so, what would the proc surveyselect method looks like, just out of curiosity.

I provided a common setting so that folks here shall have easier time to understand the structure of the data rather than scratching their heads over the technical jargon.  But your point is well taken, normally I don't kill duplicate this way.  But this time around, the data limitation is inherent, and there is not that much I can do about it.

Respected Advisor
Posts: 4,641

Re: get rid of duplicates randomly

's  Method 1 would look like this :

data test;

do id1 = 1 to 3;

     do id2 = 10 to 30 by 10;

          do no = 1 to 1 + round(ranuni(53634)); /* 1 or 2 records */

               output;

               end;

          end;

     end;

run;

title "Dataset with duplicates";

proc print data=test noobs; run;

proc surveyselect data=test method=urs sampsize=1 out=samp seed=86564;

strata id1 id2;

run;

title "Dataset with randomly selected unique keys";

proc print data=samp noobs; run;

PG

PG
Regular Contributor
Posts: 161

Re: get rid of duplicates randomly

Hi, PG

Just want to make sure , to implement your code to my situation, I need to make the following changes, right?

proc surveyselect data=have method=urs sampsize=1 out=want seed=86564;

     strata patient_id doctor_id date;

run;

Solution
‎05-17-2014 10:01 AM
Respected Advisor
Posts: 4,641

Re: get rid of duplicates randomly

Right.

PG
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 297 views
  • 6 likes
  • 3 in conversation