Solved: Re: finding duplicates obs (using sas code)

Angel_Saenz · Posted 02-01-2017 01:44 PM

How can I find duplicates obs by one or two variables? (using sas code, not proc sql)

Shmuel · Posted 02-02-2017 11:46 PM

If I understand correctly you want:

data single duplicate;
  set have;
    by id;
         if first.id then output single; 
         if not(first.id and last.id) then output duplicate;
run;

Run it and check results.

Pay attention - assume ID='A' has 3 observations and ID='B' is single, then:

OBS ID text

1 A aaa - this obs is FIRST.ID

2 A bbb

3 A ccc - this obs is LAST.ID

4 B xxx - this obs is: FIRST.ID and LAST.ID

5 C

View solution in original post

ballardw · Posted 02-01-2017 01:51 PM

Please provide some example input data and the desired result as data step code.

There are a number of ways to look at "duplicates" but when you say "by one or two variables" what you mean with "by one or two variables" is not intuitively obvious, at least to me.

What do you want to do with the duplicates? Remove them? Mark them? Combine the records? Something else?

Angel_Saenz · Posted 02-02-2017 05:33 PM

I have a dataset with duplicates by ID or can be (ID2 ID3)
I need two things:

1. one dataset with non-duplicates, for example if 3 obs has ID=1, so the result I want is one dataset just with one obs with ID=1

2. other dataset with all duplicates, for example if 3 obs has ID=1, the result is one dataset with 3 obs with ID=1

christinakwang · Posted 02-01-2017 01:55 PM

I'm just a beginner here, but are you hoping to output all duplicates into another dataset, or just remove duplicates?

If just to remove duplicates, then

proc sort data=duplicates out=deduped;
   by var1 var2;
run;

proc sort data=deduped nodupkey;
   by var1 var2;
run;

Shmuel · Posted 02-01-2017 02:38 PM

Let say you have just 3 observations:

OBS ID text

1 A aaa

2 A bbb

3 A ccc

Assuming dataset is sorted by ID (doesn't matter how many variables in ID)

then you can do:

data dsn_out;

set dsn_in;

by ID; /* or by ID1 ID2 ID3 ;*/

/* to save 1st obs only */ if first.ID; /* or if first.ID3; */

/* to save last obs only */ if last.ID; /* or if last.ID3; */

/* to save any ID that has more than one obs - with all its obs */

if not (first.ID and last.ID);

/* to save only IDs with one obs only - exclude duplicates */

if first.ID and last.ID;

run;

Run each possibilit separately and check results.

Angel_Saenz · Posted 02-02-2017 02:51 PM

thank you Shmuel, I used

if not (first.ID and last.ID);

but the result was the same original dataset, I need output all duplicates, triplicates etc into another dataset

Shmuel · Posted 02-02-2017 03:08 PM

@Angel_Saenz - you say that using:

if not (first.ID and last.ID); - the result was the same original dataset.

That means that you have no instances of one observation only by any ID.

In case I have wrong conclusion and you want to split your dataset into two

you can do:

data single duplicate;

set have;

by id;

if first.id and last.id then output single;

else output duplicate;

run;

If it is not what you need then post:

1) test data

2) wanted result from test

Angel_Saenz · Posted 02-14-2017 02:26 PM

hi Shmuel I have a question you told me:

Assuming dataset is sorted by ID (doesn't matter how many variables in ID)

then you can do:

data dsn_out;

set dsn_in;

by ID; /* or by ID1 ID2 ID3 ;*/

/* to save 1st obs only */ if first.ID; /* or if first.ID3; */

But If I need 1 obs by ID1 and ID2 together, how can I do it?

ID1 ID2

1 A

1 B

2 A

2 B

2 C

2 D

result I want:

ID1 ID2

1 A

1 B

2 A

2 B

2 C

2 D

I try with

if (first.ID1 and first.ID2) but that just let me 1 obs, like using if first.ID1

I do it concatenating ID1 ID2 but I want to know if are other way

Shmuel · Posted 02-14-2017 03:59 PM

@Angel_Saenz - you ask for: "I need 1 obs by ID1 and ID2 together, how can I do it?"

Then the code should be:

data dsn_out;
   set dsn_in;
    by ID1 ID2;
         if first.ID2; 
run;

Angel_Saenz · Posted 02-02-2017 02:47 PM

Thank you christinakwang I need output all duplicates, triplicates etc into another dataset

anoopmohandas7 · Posted 02-02-2017 03:42 PM

So you need non-duplicates into one dataset and everything else into another dataset ?

Angel_Saenz · Posted 02-02-2017 05:25 PM

I have a dataset with duplicates by ID

I need two things:

1. one dataset with non-duplicates, for example if 3 obs has ID=1, so the result I want is one dataset just with one obs with ID=1

2. other dataset with all duplicates, for example if 3 obs has ID=1, the result is one dataset with 3 obs with ID=1

anoopmohandas7 · Posted 02-02-2017 09:49 PM

Angel,

The below is a shade of what was proposed earlier. Two dataset's would be created as a outcome with one having non-duplicates and other having all duplicates.

data have ;
length city $20. ;
input id city $;
datalines ;
12 Jacksonville
13 Seattle
13 Seattle
14 Dallas
14 Dallas
14 Dallas
15 Tampa
15 Tampa
15 Tampa
15 Tampa
;
run ;

data nodup dupli  ;
set have ;
by id ;
if first.id =1 and last.id=1 then output nodup ;
else output dupli ;
run ;

Shmuel · Posted 02-02-2017 11:46 PM

If I understand correctly you want:

data single duplicate;
  set have;
    by id;
         if first.id then output single; 
         if not(first.id and last.id) then output duplicate;
run;

Run it and check results.

Pay attention - assume ID='A' has 3 observations and ID='B' is single, then:

OBS ID text

1 A aaa - this obs is FIRST.ID

2 A bbb

3 A ccc - this obs is LAST.ID

4 B xxx - this obs is: FIRST.ID and LAST.ID

5 C

fbl204653 · Posted 02-02-2017 03:15 PM

Try:

Proc sort data=test dupout=duplicate nodupkey;

by var1 var2;

run;

Registration is open

SAS Training: Just a Click Away