Hi SAS experts,
My dataset contains duplicate observations and the second observation of the duplicate contains information I need for some variables.
I have sorted the data by 2 variables: record_id and event_name, and by using proc sort with nodupkey SAS deletes the second observation of the duplicate.
Is there a way for me to keep the second observation and delete the first of the duplicates?
Thank you in advance for any advice!!
That's what my code does:
data have;
input record_id event_name :&$12. instrument :&$12. vital performance symptoms;
datalines;
1 . consult 1 . .
1 3 months follow up . . .
1 3 months follow up 1 80 1
1 6 months follow up 1 . .
1 1 year follow up 1 80 0
;
proc sort data=have; by record_id event_name; run;
data want;
set have;
by record_id event_name;
/* To keep only the last record */
if last.event_name;
/* OR */
/* To remove only the first record when there are many */
*if not first.event_name or last.event_name;
run;
proc print; run;
record_ event_ Obs id name instrument vital performance symptoms 1 1 consult 1 . . 2 1 1 year follow up 1 80 0 3 1 3 months follow up 1 80 1 4 1 6 months follow up 1 . .
Can you post input and required output samples plz
Currently my dataset has:
record_id event_name instrument vital performance symptoms ...........
1 consult 1
1 3 months follow up
1 3 months follow up 1 80 1
1 6 months follow up 1
1 1 year follow up 1 80 0
There are about 5000 observations in my dataset and a lot of record has duplicates like record 1, and I would like to keep the second observation of the duplicate as it has data values in variables I am interested.
assuming you want to keep the second of each by group i.e record_id
data want;
do _n_=1 by 1 until(last.record_id);
set have;
by record_id;
if _n_=2 then output;
end;
run;
/*or/And what if you had just one record_idin some cases*/
data want;
do _n_=1 by 1 until(last.record_id);
set have;
by record_id;
if first.record_id and last.record_id then output;
else if _n_=2 then output;
end;
run;
BY processing can be put to profit here
data want;
set have;
by record_id event_name;
/* To keep only the last record */
if last.event_name;
/* OR */
/* To remove only the first record when there are many */
*if not first.event_name or last.event_name;
run;
Apologies but I should have specified that I would like to keep the second record of the 3 month in my example. So each line would be a different event_name.
I would like to have:
record_id event_name instrument vital performance symptoms ...........
1 consult 1
1 3 months follow up 1 80 1
1 6 months follow up 1
1 1 year follow up 1 80 0
That's what my code does:
data have;
input record_id event_name :&$12. instrument :&$12. vital performance symptoms;
datalines;
1 . consult 1 . .
1 3 months follow up . . .
1 3 months follow up 1 80 1
1 6 months follow up 1 . .
1 1 year follow up 1 80 0
;
proc sort data=have; by record_id event_name; run;
data want;
set have;
by record_id event_name;
/* To keep only the last record */
if last.event_name;
/* OR */
/* To remove only the first record when there are many */
*if not first.event_name or last.event_name;
run;
proc print; run;
record_ event_ Obs id name instrument vital performance symptoms 1 1 consult 1 . . 2 1 1 year follow up 1 80 0 3 1 3 months follow up 1 80 1 4 1 6 months follow up 1 . .
Thank you!!! It worked! I was stuck on this for a few days and didn't want to try to delete the first duplicate manually. It's amazing how the coding is just 2 lines!
I believe you need to sort by at least one additional variable as else if the source data comes in a different order you can easily end-up picking the records with the missings - which I believe is not what the OP wants.
I see! Thank you for the wonderful advice!! I will add an additional variable in proc sort
I understand your point @Patrick, but there might not be another variable to sort by. However there should be no need for an extra sorting variable as the default sorting behavior is to keep the original ordering within by-groups (see the SORTEQUALS option.)
For the data you've posted the following should work (which is basically what @PGStats already suggested).
data have;
infile datalines truncover dsd dlm='|' ;
input record_id event_name :$20. instrument vital performance symptoms;
datalines;
1|consult|1|
1|3 months|follow up
1|3 months|follow up|1|80|1
1|6 months|follow up|1|
1|1 year|follow up|1|80|0
;
run;
proc sort data=have out=inter;
by record_id event_name vital performance symptoms;
run;
data want;
set inter;
by record_id event_name;
if last.event_name;
run;
Question is if the sample data you've posted is sufficiently representative for your real data. I.e. if the record you want to keep has VITAL as missing but there is a "duplicate" with Vital populated then above code wouldn't work.
Also if you need logic which just consolidates all non-missing data per record_id and event_name into a single record then the code logic needs also to be different.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.