Hello Everyone!
I have a dataset with 6 years worth of patient healthcare claims. Patients appear multiple times within each year and also reappear in subsequent years. What I want to do is eliminate the duplicates within the year.
The code I used was:
proc sort data=dataset1 nodupkey out=dataset2;
by patient_id;
run;
While this code removes the duplicate patients within each year, it also removes the patients from subsequent years, which I do not want. In other words, patient 01 appears 10 times in 2009 3 times in 2010 7 times in 2011. The code I used removed 9 observations of patient 01 in 2009 but also completely removed all observations in the subsequent years which made it seem like patient 01 didn't have any health claims in 2010 or 2011.
Is there a way to remove duplicate patients within just one year without it impacting the other years? I do not want to split my dataset if I don't have to. Some additional information that might be helpful:
1) I use SAS 9.2
2) My years are in a variable (2009 through 2011) and are numeric
3) patient_id is a character variable
Hi mate,
First you need to check the sintax of your clause.
In fact, it is working correctly once you say to remove duplicates, only by an id variable.
But if what you need is to check if the entire row is duplicated you can use the following sintax:
proc sort data=dataset1 nodupkey out=dataset2;
by _all_;
run;
Or you can only add the new column that is need.
Hope this helps
Hi mate,
First you need to check the sintax of your clause.
In fact, it is working correctly once you say to remove duplicates, only by an id variable.
But if what you need is to check if the entire row is duplicated you can use the following sintax:
proc sort data=dataset1 nodupkey out=dataset2;
by _all_;
run;
Or you can only add the new column that is need.
Hope this helps
Thanks for the help! Your idea helped me tweak my code just right to get what I needed 🙂
Great, there are many other ways to remove duplicates.
You can google it, you will see there is one tricky way but efficient.
Is to use the first or last statement.
order your table and do the following.
Data want;
set have;
by need_variable;
if first.need_variable then num = 1;
else num=0;
run;
data want2;
set want;
where num=1;
run;
You can use as many variables as you need. just need to sort before you do this. Sort without nodupkey.
Does your dataset have only ID and year? Otherwise, I wouldn't expect that code to work.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.