BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
michan22
Quartz | Level 8

Hi SAS experts,

 

My dataset contains duplicate observations and the second observation of the duplicate contains information I need for some variables.

I have sorted the data by 2 variables: record_id and event_name, and by using proc sort with nodupkey SAS deletes the second observation of the duplicate.

Is there a way for me to keep the second observation and delete the first of the duplicates?

Thank you in advance for any advice!!

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

That's what my code does:

 

data have;
input record_id event_name :&$12. instrument :&$12. vital performance symptoms;
datalines;
1                    .                      consult                  1  . .                     
1                    3 months               follow up    . . . 
1                    3 months               follow up            1                 80                             1
1                    6 months               follow up            1 . .                                                  
1                    1 year                 follow up            1                  80                            0
;

proc sort data=have; by record_id event_name; run;

data want;
set have;
by record_id event_name;

/* To keep only the last record */
if last.event_name;

/* OR */
/* To remove only the first record when there are many */
*if not first.event_name or last.event_name;

run;

proc print; run;
           record_     event_
    Obs       id        name      instrument    vital    performance    symptoms

     1        1                   consult         1            .            .
     2        1       1 year      follow up       1           80            0
     3        1       3 months    follow up       1           80            1
     4        1       6 months    follow up       1            .            .
PG

View solution in original post

11 REPLIES 11
novinosrin
Tourmaline | Level 20

Can you post input and required output samples plz

michan22
Quartz | Level 8

Currently my dataset has:

 

record_id       event_name          instrument          vital           performance             symptoms           ...........

1                    consult                  1                       

1                    3 months               follow up 

1                    3 months               follow up            1                 80                             1

1                    6 months               follow up            1                                                  

1                    1 year                    follow up            1                  80                            0

 

There are about 5000 observations in my dataset and a lot of record has duplicates like record 1, and I would like to keep the second observation of the duplicate as it has data values in variables I am interested.

 

novinosrin
Tourmaline | Level 20

assuming you want to keep the second of each by group i.e record_id

 

data want;
do _n_=1 by 1 until(last.record_id);
set have;
by record_id;
if _n_=2 then output;
end;
run;
/*or/And what if you had just one record_idin some cases*/
data want;
do _n_=1 by 1 until(last.record_id);
set have;
by record_id;
if first.record_id and last.record_id then output;
else if _n_=2 then output;
end;
run;
PGStats
Opal | Level 21

BY processing can be put to profit here

 

data want;
set have;
by record_id event_name;

/* To keep only the last record */
if last.event_name;

/* OR */
/* To remove only the first record when there are many */
*if not first.event_name or last.event_name;

run;
PG
michan22
Quartz | Level 8

Apologies but I should have specified that I would like to keep the second record of the 3 month in my example. So each line would be a different event_name.

I would like to have:

 

record_id       event_name          instrument          vital           performance             symptoms           ...........

1                    consult                  1                       

1                    3 months               follow up            1                 80                             1

1                    6 months               follow up            1                                                  

1                    1 year                    follow up            1                  80                            0

PGStats
Opal | Level 21

That's what my code does:

 

data have;
input record_id event_name :&$12. instrument :&$12. vital performance symptoms;
datalines;
1                    .                      consult                  1  . .                     
1                    3 months               follow up    . . . 
1                    3 months               follow up            1                 80                             1
1                    6 months               follow up            1 . .                                                  
1                    1 year                 follow up            1                  80                            0
;

proc sort data=have; by record_id event_name; run;

data want;
set have;
by record_id event_name;

/* To keep only the last record */
if last.event_name;

/* OR */
/* To remove only the first record when there are many */
*if not first.event_name or last.event_name;

run;

proc print; run;
           record_     event_
    Obs       id        name      instrument    vital    performance    symptoms

     1        1                   consult         1            .            .
     2        1       1 year      follow up       1           80            0
     3        1       3 months    follow up       1           80            1
     4        1       6 months    follow up       1            .            .
PG
michan22
Quartz | Level 8

Thank you!!! It worked! I was stuck on this for a few days and didn't want to try to delete the first duplicate manually. It's amazing how the coding is just 2 lines!

 

Patrick
Opal | Level 21

@PGStats

I believe you need to sort by at least one additional variable as else if the source data comes in a different order you can easily end-up picking the records with the missings - which I believe is not what the OP wants.

 

michan22
Quartz | Level 8

I see! Thank you for the wonderful advice!! I will add an additional variable in proc sort

PGStats
Opal | Level 21

I understand your point @Patrick, but there might not be another variable to sort by. However there should be no need for an extra sorting variable as the default sorting behavior is to keep the original ordering within by-groups (see the SORTEQUALS option.)

PG
Patrick
Opal | Level 21

@michan22

For the data you've posted the following should work (which is basically what @PGStats already suggested).

data have;
  infile datalines truncover dsd dlm='|' ;
  input record_id event_name :$20. instrument vital performance symptoms;
  datalines;
1|consult|1|
1|3 months|follow up 
1|3 months|follow up|1|80|1
1|6 months|follow up|1|
1|1 year|follow up|1|80|0
;
run;

proc sort data=have out=inter;
  by record_id event_name vital performance symptoms;
run;

data want;
  set inter;
  by record_id event_name;
  if last.event_name;
run;

Question is if the sample data you've posted is sufficiently representative for your real data. I.e. if the record you want to keep has VITAL as missing but there is a "duplicate" with Vital populated then above code wouldn't work. 

Also if you need logic which just consolidates all non-missing data per record_id and event_name into a single record then the code logic needs also to be different.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 7265 views
  • 1 like
  • 4 in conversation