How can I find duplicates obs by one or two variables? (using sas code, not proc sql)
If I understand correctly you want:
data single duplicate;
set have;
by id;
if first.id then output single;
if not(first.id and last.id) then output duplicate;
run;
Run it and check results.
Pay attention - assume ID='A' has 3 observations and ID='B' is single, then:
OBS ID text
1 A aaa - this obs is FIRST.ID
2 A bbb
3 A ccc - this obs is LAST.ID
4 B xxx - this obs is: FIRST.ID and LAST.ID
5 C
Please provide some example input data and the desired result as data step code.
There are a number of ways to look at "duplicates" but when you say "by one or two variables" what you mean with "by one or two variables" is not intuitively obvious, at least to me.
What do you want to do with the duplicates? Remove them? Mark them? Combine the records? Something else?
I have a dataset with duplicates by ID or can be (ID2 ID3)
I need two things:
1. one dataset with non-duplicates, for example if 3 obs has ID=1, so the result I want is one dataset just with one obs with ID=1
2. other dataset with all duplicates, for example if 3 obs has ID=1, the result is one dataset with 3 obs with ID=1
I'm just a beginner here, but are you hoping to output all duplicates into another dataset, or just remove duplicates?
If just to remove duplicates, then
proc sort data=duplicates out=deduped; by var1 var2; run; proc sort data=deduped nodupkey; by var1 var2; run;
Let say you have just 3 observations:
OBS ID text
1 A aaa
2 A bbb
3 A ccc
Assuming dataset is sorted by ID (doesn't matter how many variables in ID)
then you can do:
data dsn_out;
set dsn_in;
by ID; /* or by ID1 ID2 ID3 ;*/
/* to save 1st obs only */ if first.ID; /* or if first.ID3; */
/* to save last obs only */ if last.ID; /* or if last.ID3; */
/* to save any ID that has more than one obs - with all its obs */
if not (first.ID and last.ID);
/* to save only IDs with one obs only - exclude duplicates */
if first.ID and last.ID;
run;
Run each possibilit separately and check results.
thank you Shmuel, I used
if not (first.ID and last.ID);
but the result was the same original dataset, I need output all duplicates, triplicates etc into another dataset
@Angel_Saenz - you say that using:
if not (first.ID and last.ID); - the result was the same original dataset.
That means that you have no instances of one observation only by any ID.
In case I have wrong conclusion and you want to split your dataset into two
you can do:
data single duplicate;
set have;
by id;
if first.id and last.id then output single;
else output duplicate;
run;
If it is not what you need then post:
1) test data
2) wanted result from test
hi Shmuel I have a question you told me:
Assuming dataset is sorted by ID (doesn't matter how many variables in ID)
then you can do:
data dsn_out;
set dsn_in;
by ID; /* or by ID1 ID2 ID3 ;*/
/* to save 1st obs only */ if first.ID; /* or if first.ID3; */
But If I need 1 obs by ID1 and ID2 together, how can I do it?
ID1 ID2
1 A
1 A
1 B
1 B
2 A
2 A
2 B
2 B
2 C
2 D
result I want:
ID1 ID2
1 A
1 B
2 A
2 B
2 C
2 D
I try with
if (first.ID1 and first.ID2) but that just let me 1 obs, like using if first.ID1
I do it concatenating ID1 ID2 but I want to know if are other way
@Angel_Saenz - you ask for: "I need 1 obs by ID1 and ID2 together, how can I do it?"
Then the code should be:
data dsn_out;
set dsn_in;
by ID1 ID2;
if first.ID2;
run;
Thank you christinakwang I need output all duplicates, triplicates etc into another dataset
I have a dataset with duplicates by ID
I need two things:
1. one dataset with non-duplicates, for example if 3 obs has ID=1, so the result I want is one dataset just with one obs with ID=1
2. other dataset with all duplicates, for example if 3 obs has ID=1, the result is one dataset with 3 obs with ID=1
Angel,
The below is a shade of what was proposed earlier. Two dataset's would be created as a outcome with one having non-duplicates and other having all duplicates.
data have ;
length city $20. ;
input id city $;
datalines ;
12 Jacksonville
13 Seattle
13 Seattle
14 Dallas
14 Dallas
14 Dallas
15 Tampa
15 Tampa
15 Tampa
15 Tampa
;
run ;
data nodup dupli ;
set have ;
by id ;
if first.id =1 and last.id=1 then output nodup ;
else output dupli ;
run ;
If I understand correctly you want:
data single duplicate;
set have;
by id;
if first.id then output single;
if not(first.id and last.id) then output duplicate;
run;
Run it and check results.
Pay attention - assume ID='A' has 3 observations and ID='B' is single, then:
OBS ID text
1 A aaa - this obs is FIRST.ID
2 A bbb
3 A ccc - this obs is LAST.ID
4 B xxx - this obs is: FIRST.ID and LAST.ID
5 C
Try:
Proc sort data=test dupout=duplicate nodupkey;
by var1 var2;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.