BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
TXSASneophyte
Obsidian | Level 7

Hello Everyone!

 

I have a dataset with 6 years worth of patient healthcare claims. Patients appear multiple times within each year and also reappear in subsequent years. What I want to do is eliminate the duplicates within the year. 

 

The code I used was:

 

proc sort data=dataset1 nodupkey out=dataset2;

by patient_id;

run;

 

While this code removes the duplicate patients within each year, it also removes the patients from subsequent years, which I do not want. In other words, patient 01 appears 10 times in 2009 3 times in 2010 7 times in 2011. The code I used removed 9 observations of patient 01 in 2009 but also completely removed all observations in the subsequent years which made it seem like patient 01 didn't have any health claims in 2010 or 2011. 

 

Is there a way to remove duplicate patients within just one year without it impacting the other years? I do not want to split my dataset if I don't have to. Some additional information that might be helpful:

 

1) I use SAS 9.2

2) My years are in a variable (2009 through 2011) and are numeric

3) patient_id is a character variable

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
DartRodrigo
Lapis Lazuli | Level 10

Hi mate,

First you need to check the sintax of your clause.

In fact, it is working correctly once you say to remove duplicates, only by an id variable.

But if what you need is to check if the entire row is duplicated you can use the following sintax:

 

proc sort data=dataset1 nodupkey out=dataset2;
by _all_;
run;

Or you can only add the new column that is need.

 

Hope this helps

View solution in original post

5 REPLIES 5
DartRodrigo
Lapis Lazuli | Level 10

Hi mate,

First you need to check the sintax of your clause.

In fact, it is working correctly once you say to remove duplicates, only by an id variable.

But if what you need is to check if the entire row is duplicated you can use the following sintax:

 

proc sort data=dataset1 nodupkey out=dataset2;
by _all_;
run;

Or you can only add the new column that is need.

 

Hope this helps

TXSASneophyte
Obsidian | Level 7

Thanks for the help! Your idea helped me tweak my code just right to get what I needed 🙂

DartRodrigo
Lapis Lazuli | Level 10

Great, there are many other ways to remove duplicates.

 

You can google it, you will see there is one tricky way but efficient. 

Is to use the first or last statement.

 

order your table and do the following.

Data want;
   set have;
   by need_variable;
   if first.need_variable then num = 1;
   else num=0;
run;

data want2;
  set want;
 where num=1;
run;

You can use as many variables as you need. just need to sort before you do this. Sort without nodupkey.

Reeza
Super User

Does your dataset have only ID and year? Otherwise, I wouldn't expect that code to work. 

bcolip
Calcite | Level 5
Hello, another method that you could try is to concatenate the patient_id and year into a new variable as your UniqueID. For instance, patient 01 in 2010 would be 012010 and for 2011 would be 012011. Then delete duplicates based on this UniqueID.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 5083 views
  • 1 like
  • 4 in conversation