DATA Step, Macro, Functions and more

finding duplicates obs (using sas code)

Accepted Solution Solved
Reply
Contributor
Posts: 39
Accepted Solution

finding duplicates obs (using sas code)

How can I find duplicates obs by one or two variables? (using sas code, not proc sql)


Accepted Solutions
Solution
‎02-03-2017 12:43 PM
Trusted Advisor
Posts: 1,381

Re: finding duplicates obs (using sas code)

[ Edited ]

If I understand correctly you want:

data single duplicate;
  set have;
    by id;
if first.id then output single; if not(first.id and last.id) then output duplicate; run;

Run it and check results.

 

Pay attention - assume ID='A' has 3 observations and ID='B' is single, then:

OBS   ID  text

  1      A    aaa    - this obs is FIRST.ID

  2      A    bbb

  3      A    ccc    - this obs is LAST.ID

  4      B    xxx    - this obs is:  FIRST.ID and LAST.ID

  5      C 

View solution in original post


All Replies
Super User
Posts: 10,500

Re: finding duplicates obs (using sas code)

Please provide some example input data and the desired result as data step code.

 

There are a number of ways to look at "duplicates" but when you say "by one or two variables" what you mean with "by one or two variables" is not intuitively obvious, at least to me.

 

What do you want to do with the duplicates? Remove them? Mark them? Combine the records? Something else?

Contributor
Posts: 39

Re: finding duplicates obs (using sas code)

I have a dataset with duplicates by ID or can be (ID2 ID3)
I need two things:

1. one dataset with non-duplicates, for example if 3 obs has ID=1, so the result I want is one dataset just with one obs with ID=1

2. other dataset with all duplicates, for example if 3 obs has ID=1, the result is one dataset with 3 obs with ID=1

Occasional Contributor
Posts: 16

Re: finding duplicates obs (using sas code)

I'm just a beginner here, but are you hoping to output all duplicates into another dataset, or just remove duplicates? 

 

If just to remove duplicates, then

 

proc sort data=duplicates out=deduped;
   by var1 var2;
run;

proc sort data=deduped nodupkey;
   by var1 var2;
run;
Trusted Advisor
Posts: 1,381

Re: finding duplicates obs (using sas code)

Let say you have just 3 observations:

OBS   ID  text

  1      A    aaa

  2      A    bbb

  3      A    ccc

Assuming dataset is sorted by ID (doesn't matter how many variables in ID)

then you can do:

       data dsn_out;

         set dsn_in;

               by ID;   /* or by ID1 ID2 ID3 ;*/

                        /* to save 1st obs only */   if first.ID;    /* or if first.ID3; */

                        /* to save last obs only */  if last.ID;     /* or if last.ID3; */

 

                        /* to save any ID that has more than one obs - with all its obs */

                          if not (first.ID and last.ID);

 

                       /* to save only IDs with one obs only - exclude duplicates */

                          if first.ID and last.ID;

      run;

 

  Run each possibilit separately and check results.

                       

 

 

Contributor
Posts: 39

Re: finding duplicates obs (using sas code)

thank you Shmuel, I used

if not (first.ID and last.ID);

but the result was the same original dataset, I need output all duplicates, triplicates etc into another dataset

 

 

Trusted Advisor
Posts: 1,381

Re: finding duplicates obs (using sas code)

@Angel_Saenz - you say that  using:

          if not (first.ID and last.ID);  - the result was the same original dataset.

That means that you have no instances of one observation only by any ID.

 

In case I have wrong conclusion and you want to split your dataset into two

you can do:

 

data single duplicate;

  set have;

    by id;

         if first.id and last.id then output single;

         else output duplicate;

run;

 

If it is not what you need then post:

1) test data

2) wanted result from test

 

Contributor
Posts: 39

Re: finding duplicates obs (using sas code)

hi Shmuel I have a question you told me:

 

Assuming dataset is sorted by ID (doesn't matter how many variables in ID)

then you can do:

       data dsn_out;

         set dsn_in;

               by ID;   /* or by ID1 ID2 ID3 ;*/

                        /* to save 1st obs only */   if first.ID;    /* or if first.ID3; */

 

But If I need 1 obs by ID1 and ID2 together, how can I do it?

ID1 ID2

1     A

1     A

1     B

1     B

2     A

2     A

2     B

2     B

2     C

2     D

 

result I want:

ID1 ID2

1     A

1     B

2     A

2     B

2     C

2     D

 

I try with

if (first.ID1 and first.ID2) but that just let me 1 obs, like using if first.ID1

I do it concatenating ID1 ID2 but I want to know if are other way

Trusted Advisor
Posts: 1,381

Re: finding duplicates obs (using sas code)

@Angel_Saenz - you ask for: "I need 1 obs by ID1 and ID2 together, how can I do it?"

 

Then the code should be:

data dsn_out;
   set dsn_in;
    by ID1 ID2;
         if first.ID2; 
run;
Contributor
Posts: 39

Re: finding duplicates obs (using sas code)

Thank you christinakwang I need output all duplicates, triplicates etc into another dataset

Contributor
Posts: 44

Re: finding duplicates obs (using sas code)

So you need non-duplicates into one dataset and everything else into another dataset ?
Contributor
Posts: 39

Re: finding duplicates obs (using sas code)

I have a dataset with duplicates by ID

I need two things:

1. one dataset with non-duplicates, for example if 3 obs has ID=1, so the result I want is one dataset just with one obs with ID=1

2. other dataset with all duplicates, for example if 3 obs has ID=1, the result is one dataset with 3 obs with ID=1

Contributor
Posts: 44

Re: finding duplicates obs (using sas code)

Angel,

 

The below is a shade of what was proposed earlier. Two dataset's would be created as a outcome with one having non-duplicates and other having all duplicates.

 

data have ;
length city $20. ;
input id city $;
datalines ;
12 Jacksonville
13 Seattle
13 Seattle
14 Dallas
14 Dallas
14 Dallas
15 Tampa
15 Tampa
15 Tampa
15 Tampa
;
run ;

data nodup dupli  ;
set have ;
by id ;
if first.id =1 and last.id=1 then output nodup ;
else output dupli ;
run ;

Solution
‎02-03-2017 12:43 PM
Trusted Advisor
Posts: 1,381

Re: finding duplicates obs (using sas code)

[ Edited ]

If I understand correctly you want:

data single duplicate;
  set have;
    by id;
if first.id then output single; if not(first.id and last.id) then output duplicate; run;

Run it and check results.

 

Pay attention - assume ID='A' has 3 observations and ID='B' is single, then:

OBS   ID  text

  1      A    aaa    - this obs is FIRST.ID

  2      A    bbb

  3      A    ccc    - this obs is LAST.ID

  4      B    xxx    - this obs is:  FIRST.ID and LAST.ID

  5      C 

Contributor
Posts: 32

Re: finding duplicates obs (using sas code)

Try:

 

Proc sort data=test dupout=duplicate nodupkey;

by var1 var2;

run;

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 14 replies
  • 317 views
  • 7 likes
  • 6 in conversation