DATA Step, Macro, Functions and more

Removing Duplicates by multiple key variables

Accepted Solution Solved
Reply
Contributor
Posts: 45
Accepted Solution

Removing Duplicates by multiple key variables

[ Edited ]

Hello Everyone!

 

I have a dataset with 6 years worth of patient healthcare claims. Patients appear multiple times within each year and also reappear in subsequent years. What I want to do is eliminate the duplicates within the year. 

 

The code I used was:

 

proc sort data=dataset1 nodupkey out=dataset2;

by patient_id;

run;

 

While this code removes the duplicate patients within each year, it also removes the patients from subsequent years, which I do not want. In other words, patient 01 appears 10 times in 2009 3 times in 2010 7 times in 2011. The code I used removed 9 observations of patient 01 in 2009 but also completely removed all observations in the subsequent years which made it seem like patient 01 didn't have any health claims in 2010 or 2011. 

 

Is there a way to remove duplicate patients within just one year without it impacting the other years? I do not want to split my dataset if I don't have to. Some additional information that might be helpful:

 

1) I use SAS 9.2

2) My years are in a variable (2009 through 2011) and are numeric

3) patient_id is a character variable

 

 

 


Accepted Solutions
Solution
‎04-15-2016 02:10 PM
Regular Contributor
Posts: 212

Re: Advance: Removing Duplicates

Hi mate,

First you need to check the sintax of your clause.

In fact, it is working correctly once you say to remove duplicates, only by an id variable.

But if what you need is to check if the entire row is duplicated you can use the following sintax:

 

proc sort data=dataset1 nodupkey out=dataset2;
by _all_;
run;

Or you can only add the new column that is need.

 

Hope this helps

View solution in original post


All Replies
Solution
‎04-15-2016 02:10 PM
Regular Contributor
Posts: 212

Re: Advance: Removing Duplicates

Hi mate,

First you need to check the sintax of your clause.

In fact, it is working correctly once you say to remove duplicates, only by an id variable.

But if what you need is to check if the entire row is duplicated you can use the following sintax:

 

proc sort data=dataset1 nodupkey out=dataset2;
by _all_;
run;

Or you can only add the new column that is need.

 

Hope this helps

Contributor
Posts: 45

Re: Advance: Removing Duplicates

Thanks for the help! Your idea helped me tweak my code just right to get what I needed Smiley Happy

Regular Contributor
Posts: 212

Re: Advance: Removing Duplicates

Great, there are many other ways to remove duplicates.

 

You can google it, you will see there is one tricky way but efficient. 

Is to use the first or last statement.

 

order your table and do the following.

Data want;
   set have;
   by need_variable;
   if first.need_variable then num = 1;
   else num=0;
run;

data want2;
  set want;
 where num=1;
run;

You can use as many variables as you need. just need to sort before you do this. Sort without nodupkey.

Super User
Posts: 17,885

Re: Removing Duplicates by multiple key variables

Does your dataset have only ID and year? Otherwise, I wouldn't expect that code to work. 

Regular Learner
Posts: 1

Re: Removing Duplicates by multiple key variables

Hello, another method that you could try is to concatenate the patient_id and year into a new variable as your UniqueID. For instance, patient 01 in 2010 would be 012010 and for 2011 would be 012011. Then delete duplicates based on this UniqueID.
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 333 views
  • 1 like
  • 4 in conversation