Solved: Removing Duplicates by multiple key variables

TXSASneophyte · Posted 04-15-2016 01:30 PM

Hello Everyone!

I have a dataset with 6 years worth of patient healthcare claims. Patients appear multiple times within each year and also reappear in subsequent years. What I want to do is eliminate the duplicates within the year.

The code I used was:

proc sort data=dataset1 nodupkey out=dataset2;

by patient_id;

run;

While this code removes the duplicate patients within each year, it also removes the patients from subsequent years, which I do not want. In other words, patient 01 appears 10 times in 2009 3 times in 2010 7 times in 2011. The code I used removed 9 observations of patient 01 in 2009 but also completely removed all observations in the subsequent years which made it seem like patient 01 didn't have any health claims in 2010 or 2011.

Is there a way to remove duplicate patients within just one year without it impacting the other years? I do not want to split my dataset if I don't have to. Some additional information that might be helpful:

1) I use SAS 9.2

2) My years are in a variable (2009 through 2011) and are numeric

3) patient_id is a character variable

DartRodrigo · Posted 04-15-2016 01:51 PM

Hi mate,

First you need to check the sintax of your clause.

In fact, it is working correctly once you say to remove duplicates, only by an id variable.

But if what you need is to check if the entire row is duplicated you can use the following sintax:

proc sort data=dataset1 nodupkey out=dataset2;
by _all_;
run;

Or you can only add the new column that is need.

Hope this helps

View solution in original post

DartRodrigo · Posted 04-15-2016 01:51 PM

Hi mate,

First you need to check the sintax of your clause.

In fact, it is working correctly once you say to remove duplicates, only by an id variable.

But if what you need is to check if the entire row is duplicated you can use the following sintax:

proc sort data=dataset1 nodupkey out=dataset2;
by _all_;
run;

Or you can only add the new column that is need.

Hope this helps

TXSASneophyte · Posted 04-15-2016 02:11 PM

Thanks for the help! Your idea helped me tweak my code just right to get what I needed 🙂

DartRodrigo · Posted 04-15-2016 02:20 PM

Great, there are many other ways to remove duplicates.

You can google it, you will see there is one tricky way but efficient.

Is to use the first or last statement.

order your table and do the following.

Data want;
   set have;
   by need_variable;
   if first.need_variable then num = 1;
   else num=0;
run;

data want2;
  set want;
 where num=1;
run;

You can use as many variables as you need. just need to sort before you do this. Sort without nodupkey.

Reeza · Posted 04-15-2016 02:26 PM

Does your dataset have only ID and year? Otherwise, I wouldn't expect that code to work.

bcolip · Posted 12-03-2016 12:22 PM

Hello, another method that you could try is to concatenate the patient_id and year into a new variable as your UniqueID. For instance, patient 01 in 2010 would be 012010 and for 2011 would be 012011. Then delete duplicates based on this UniqueID.

Removing Duplicates by multiple key variables

Re: Advance: Removing Duplicates

Re: Advance: Removing Duplicates

Re: Advance: Removing Duplicates

Re: Advance: Removing Duplicates

Re: Removing Duplicates by multiple key variables

Re: Removing Duplicates by multiple key variables

Catch up on SAS Innovate 2026

Removing Duplicates by multiple key variables

Re: Advance: Removing Duplicates

Re: Advance: Removing Duplicates

Re: Advance: Removing Duplicates

Re: Advance: Removing Duplicates

Re: Removing Duplicates by multiple key variables

Re: Removing Duplicates by multiple key variables

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away