BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Angel_Saenz
Quartz | Level 8

How can I find duplicates obs by one or two variables? (using sas code, not proc sql)

1 ACCEPTED SOLUTION

Accepted Solutions
Shmuel
Garnet | Level 18

If I understand correctly you want:

data single duplicate;
  set have;
    by id;
if first.id then output single; if not(first.id and last.id) then output duplicate; run;

Run it and check results.

 

Pay attention - assume ID='A' has 3 observations and ID='B' is single, then:

OBS   ID  text

  1      A    aaa    - this obs is FIRST.ID

  2      A    bbb

  3      A    ccc    - this obs is LAST.ID

  4      B    xxx    - this obs is:  FIRST.ID and LAST.ID

  5      C 

View solution in original post

14 REPLIES 14
ballardw
Super User

Please provide some example input data and the desired result as data step code.

 

There are a number of ways to look at "duplicates" but when you say "by one or two variables" what you mean with "by one or two variables" is not intuitively obvious, at least to me.

 

What do you want to do with the duplicates? Remove them? Mark them? Combine the records? Something else?

Angel_Saenz
Quartz | Level 8

I have a dataset with duplicates by ID or can be (ID2 ID3)
I need two things:

1. one dataset with non-duplicates, for example if 3 obs has ID=1, so the result I want is one dataset just with one obs with ID=1

2. other dataset with all duplicates, for example if 3 obs has ID=1, the result is one dataset with 3 obs with ID=1

christinakwang
Obsidian | Level 7

I'm just a beginner here, but are you hoping to output all duplicates into another dataset, or just remove duplicates? 

 

If just to remove duplicates, then

 

proc sort data=duplicates out=deduped;
   by var1 var2;
run;

proc sort data=deduped nodupkey;
   by var1 var2;
run;
Shmuel
Garnet | Level 18

Let say you have just 3 observations:

OBS   ID  text

  1      A    aaa

  2      A    bbb

  3      A    ccc

Assuming dataset is sorted by ID (doesn't matter how many variables in ID)

then you can do:

       data dsn_out;

         set dsn_in;

               by ID;   /* or by ID1 ID2 ID3 ;*/

                        /* to save 1st obs only */   if first.ID;    /* or if first.ID3; */

                        /* to save last obs only */  if last.ID;     /* or if last.ID3; */

 

                        /* to save any ID that has more than one obs - with all its obs */

                          if not (first.ID and last.ID);

 

                       /* to save only IDs with one obs only - exclude duplicates */

                          if first.ID and last.ID;

      run;

 

  Run each possibilit separately and check results.

                       

 

 

Angel_Saenz
Quartz | Level 8

thank you Shmuel, I used

if not (first.ID and last.ID);

but the result was the same original dataset, I need output all duplicates, triplicates etc into another dataset

 

 

Shmuel
Garnet | Level 18

@Angel_Saenz - you say that  using:

          if not (first.ID and last.ID);  - the result was the same original dataset.

That means that you have no instances of one observation only by any ID.

 

In case I have wrong conclusion and you want to split your dataset into two

you can do:

 

data single duplicate;

  set have;

    by id;

         if first.id and last.id then output single;

         else output duplicate;

run;

 

If it is not what you need then post:

1) test data

2) wanted result from test

 

Angel_Saenz
Quartz | Level 8

hi Shmuel I have a question you told me:

 

Assuming dataset is sorted by ID (doesn't matter how many variables in ID)

then you can do:

       data dsn_out;

         set dsn_in;

               by ID;   /* or by ID1 ID2 ID3 ;*/

                        /* to save 1st obs only */   if first.ID;    /* or if first.ID3; */

 

But If I need 1 obs by ID1 and ID2 together, how can I do it?

ID1 ID2

1     A

1     A

1     B

1     B

2     A

2     A

2     B

2     B

2     C

2     D

 

result I want:

ID1 ID2

1     A

1     B

2     A

2     B

2     C

2     D

 

I try with

if (first.ID1 and first.ID2) but that just let me 1 obs, like using if first.ID1

I do it concatenating ID1 ID2 but I want to know if are other way

Shmuel
Garnet | Level 18

@Angel_Saenz - you ask for: "I need 1 obs by ID1 and ID2 together, how can I do it?"

 

Then the code should be:

data dsn_out;
   set dsn_in;
    by ID1 ID2;
         if first.ID2; 
run;
Angel_Saenz
Quartz | Level 8

Thank you christinakwang I need output all duplicates, triplicates etc into another dataset

anoopmohandas7
Quartz | Level 8
So you need non-duplicates into one dataset and everything else into another dataset ?
Angel_Saenz
Quartz | Level 8

I have a dataset with duplicates by ID

I need two things:

1. one dataset with non-duplicates, for example if 3 obs has ID=1, so the result I want is one dataset just with one obs with ID=1

2. other dataset with all duplicates, for example if 3 obs has ID=1, the result is one dataset with 3 obs with ID=1

anoopmohandas7
Quartz | Level 8

Angel,

 

The below is a shade of what was proposed earlier. Two dataset's would be created as a outcome with one having non-duplicates and other having all duplicates.

 

data have ;
length city $20. ;
input id city $;
datalines ;
12 Jacksonville
13 Seattle
13 Seattle
14 Dallas
14 Dallas
14 Dallas
15 Tampa
15 Tampa
15 Tampa
15 Tampa
;
run ;

data nodup dupli  ;
set have ;
by id ;
if first.id =1 and last.id=1 then output nodup ;
else output dupli ;
run ;

Shmuel
Garnet | Level 18

If I understand correctly you want:

data single duplicate;
  set have;
    by id;
if first.id then output single; if not(first.id and last.id) then output duplicate; run;

Run it and check results.

 

Pay attention - assume ID='A' has 3 observations and ID='B' is single, then:

OBS   ID  text

  1      A    aaa    - this obs is FIRST.ID

  2      A    bbb

  3      A    ccc    - this obs is LAST.ID

  4      B    xxx    - this obs is:  FIRST.ID and LAST.ID

  5      C 

fbl204653
Obsidian | Level 7

Try:

 

Proc sort data=test dupout=duplicate nodupkey;

by var1 var2;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 44478 views
  • 9 likes
  • 6 in conversation