Solved: Re: Removing the first observation of duplicates

michan22 · Posted 10-12-2018 03:50 PM

Hi SAS experts,

My dataset contains duplicate observations and the second observation of the duplicate contains information I need for some variables.

I have sorted the data by 2 variables: record_id and event_name, and by using proc sort with nodupkey SAS deletes the second observation of the duplicate.

Is there a way for me to keep the second observation and delete the first of the duplicates?

Thank you in advance for any advice!!

PGStats · Posted 10-13-2018 07:54 PM

That's what my code does:

data have;
input record_id event_name :&$12. instrument :&$12. vital performance symptoms;
datalines;
1                    .                      consult                  1  . .                     
1                    3 months               follow up    . . . 
1                    3 months               follow up            1                 80                             1
1                    6 months               follow up            1 . .                                                  
1                    1 year                 follow up            1                  80                            0
;

proc sort data=have; by record_id event_name; run;

data want;
set have;
by record_id event_name;

/* To keep only the last record */
if last.event_name;

/* OR */
/* To remove only the first record when there are many */
*if not first.event_name or last.event_name;

run;

proc print; run;

           record_     event_
    Obs       id        name      instrument    vital    performance    symptoms

     1        1                   consult         1            .            .
     2        1       1 year      follow up       1           80            0
     3        1       3 months    follow up       1           80            1
     4        1       6 months    follow up       1            .            .

PG

View solution in original post

novinosrin · Posted 10-12-2018 03:51 PM

Can you post input and required output samples plz

michan22 · Posted 10-12-2018 03:59 PM

Currently my dataset has:

record_id event_name instrument vital performance symptoms ...........

1 consult 1

1 3 months follow up

1 3 months follow up 1 80 1

1 6 months follow up 1

1 1 year follow up 1 80 0

There are about 5000 observations in my dataset and a lot of record has duplicates like record 1, and I would like to keep the second observation of the duplicate as it has data values in variables I am interested.

novinosrin · Posted 10-12-2018 04:02 PM

assuming you want to keep the second of each by group i.e record_id

data want;
do _n_=1 by 1 until(last.record_id);
set have;
by record_id;
if _n_=2 then output;
end;
run;
/*or/And what if you had just one record_idin some cases*/
data want;
do _n_=1 by 1 until(last.record_id);
set have;
by record_id;
if first.record_id and last.record_id then output;
else if _n_=2 then output;
end;
run;

PGStats · Posted 10-13-2018 12:34 AM

BY processing can be put to profit here

data want;
set have;
by record_id event_name;

/* To keep only the last record */
if last.event_name;

/* OR */
/* To remove only the first record when there are many */
*if not first.event_name or last.event_name;

run;

PG

michan22 · Posted 10-13-2018 07:44 PM

Apologies but I should have specified that I would like to keep the second record of the 3 month in my example. So each line would be a different event_name.

I would like to have:

record_id event_name instrument vital performance symptoms ...........

1 consult 1

1 3 months follow up 1 80 1

1 6 months follow up 1

1 1 year follow up 1 80 0

PGStats · Posted 10-13-2018 07:54 PM

That's what my code does:

data have;
input record_id event_name :&$12. instrument :&$12. vital performance symptoms;
datalines;
1                    .                      consult                  1  . .                     
1                    3 months               follow up    . . . 
1                    3 months               follow up            1                 80                             1
1                    6 months               follow up            1 . .                                                  
1                    1 year                 follow up            1                  80                            0
;

proc sort data=have; by record_id event_name; run;

data want;
set have;
by record_id event_name;

/* To keep only the last record */
if last.event_name;

/* OR */
/* To remove only the first record when there are many */
*if not first.event_name or last.event_name;

run;

proc print; run;

           record_     event_
    Obs       id        name      instrument    vital    performance    symptoms

     1        1                   consult         1            .            .
     2        1       1 year      follow up       1           80            0
     3        1       3 months    follow up       1           80            1
     4        1       6 months    follow up       1            .            .

PG

michan22 · Posted 10-13-2018 08:20 PM

Thank you!!! It worked! I was stuck on this for a few days and didn't want to try to delete the first duplicate manually. It's amazing how the coding is just 2 lines!

Patrick · Posted 10-13-2018 08:28 PM

@PGStats

I believe you need to sort by at least one additional variable as else if the source data comes in a different order you can easily end-up picking the records with the missings - which I believe is not what the OP wants.

michan22 · Posted 10-13-2018 08:49 PM

I see! Thank you for the wonderful advice!! I will add an additional variable in proc sort

PGStats · Posted 10-13-2018 10:48 PM

I understand your point @Patrick, but there might not be another variable to sort by. However there should be no need for an extra sorting variable as the default sorting behavior is to keep the original ordering within by-groups (see the SORTEQUALS option.)

PG

Patrick · Posted 10-13-2018 08:07 PM

@michan22

For the data you've posted the following should work (which is basically what @PGStats already suggested).

data have;
  infile datalines truncover dsd dlm='|' ;
  input record_id event_name :$20. instrument vital performance symptoms;
  datalines;
1|consult|1|
1|3 months|follow up 
1|3 months|follow up|1|80|1
1|6 months|follow up|1|
1|1 year|follow up|1|80|0
;
run;

proc sort data=have out=inter;
  by record_id event_name vital performance symptoms;
run;

data want;
  set inter;
  by record_id event_name;
  if last.event_name;
run;

Question is if the sample data you've posted is sufficiently representative for your real data. I.e. if the record you want to keep has VITAL as missing but there is a "duplicate" with Vital populated then above code wouldn't work.

Also if you need logic which just consolidates all non-missing data per record_id and event_name into a single record then the code logic needs also to be different.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away