- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello. I have a data set with 2 variables (ID and Var1). There are some duplicates (red color in the following example) in this data sets.
The left is what I have and the right one is what I want.
ID Var1 ID Var1
101 aa 101 aa
101 ab 101 ab
101 ac 101 ac
101 ad 101 ad
102 ca 102 ca
103 da 103 da
101 aa
101 ab
101 ac
101 ad
The code I am using is
Proc sort data=have out=want nodupkey;
by ID Var1;
run;
However, I found that only one duplicate observation was deleted and the reminaing duplicates are still in the data set. Did I use the wrong code? Thanks.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@dapenDaniel wrote:
Hello. I have a data set with 2 variables (ID and Var1). There are some duplicates (red color in the following example) in this data sets.
The left is what I have and the right one is what I want.
ID Var1 ID Var1
101 aa 101 aa
101 ab 101 ab
101 ac 101 ac
101 ad 101 ad
102 ca 102 ca
103 da 103 da
101 aa
101 ab
101 ac
101 ad
The code I am using is
Proc sort data=have out=want nodupkey; by ID Var1; run;
However, I found that only one duplicate observation was deleted and the reminaing duplicates are still in the data set. Did I use the wrong code? Thanks.
We can't tell if you used the wrong code because you have not shown what you actually got for a result. If you did not get the result you show it may mean that your data is not quite what you think. If a value has a leading space then ' aa' is not equal to 'aa' and would not be deleted. This is a potential issue with any of the variables on the BY statement. Another is if a value is numeric the assigned format might not be displaying decimal values that would show up as not equal.
Example of this last issue:
data example; input x; format x f1.; datalines; 1.5 1.6 1.7 ; run;
If you print or view the Example data set created above all of the values will appear as 2 because of the format but the actual values are different.
This is one reason we recommend posting data as an actual data step. Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
When I run
data have;
input ID Var1 $;
datalines;
101 aa
101 ab
101 ac
101 ad
102 ca
103 da
101 aa
101 ab
101 ac
101 ad
;
proc sort data=have out=want nodupkey;
by ID Var1;
run;
The log says
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Alternatively with first. in the data step.
data have;
input ID Var1 $;
datalines;
101 aa
101 ab
101 ac
101 ad
102 ca
103 da
101 aa
101 ab
101 ac
101 ad
;
proc sort data=have;
by ID Var1;
run;
data want;
set have;
by ID Var1;
if first.Var1;
run;
Jag
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have;
input ID Var1 $;
datalines;
101 aa
101 ab
101 ac
101 ad
102 ca
103 da
101 aa
101 ab
101 ac
101 ad
;
proc sort data =have out= want noduprecs;
by id var1 ;
run;
/*or*/
proc sql;
create table want as
select distinct *
from have;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Be careful recommending NODUPRECS ... because it only deletes duplicate records that are adjacent to each other. For instance the following proc sort does NOT delete any records, even though records 1 and 3 are duplicates:
data have;
input id x y z;
datalines;
1 1 1 1
1 2 2 2
1 1 1 1
run;
proc sort data=have out=sorted_have noduprecs;
by id;
run;
I suggest keeping nodupkey, but use the variable proxy _ALL_ as the last sort key, as in
proc sort data=have out=sorted_have_all nodupkeys;
by id _ALL_;
run;
The technique here is to list all the desired primary sort keys first, and then append _ALL_ at the end of the by-list. It doesn't matter that ID is both the primary sort key, and also a subordinate sort key because it is part of _ALL_. The subordinate usage will have no impact on resulting sort order.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you @mkeintz . I know and love it when you've helped and continue to help me learn numerous times. What I assumed was OP wanting select unique records and my understanding of noduprecs( exclude duplicate records and pick one record) rather than dup keys which is akin to select distinct * in proc sql, and hence i posted that lazy version too.
Nonetheless when and if you participate, i tend to pay sincere attention and kept thinking. So basically I think
proc sort
by _all_
with a noduprecs
should be equal to
proc sort
by all variables listed one after another with a nodupkey
Also OP's requirement seems to pick a combination of all vars as unique obs than based keys unless of course all keys making it a composite combination.
data have;
input id x y z;
datalines;
1 1 1 1
1 2 2 2
1 1 1 1
;
run;
proc sort noduprecs;
by _all_;
run;
I wonder if my understanding is correct, distinct sql is perhaps the easiest/laziest and convenient way to go 🙂 Btw, my apologies for the late acknowledgement as I slept like a baby all weekend and barely moved. I hope you have a nice day and week ahead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes. To my knowledge the code
proc sort data=have out=want noduprecs;
by _all_;
run;
produces the same results as
proc sql ;
create table want as
select distinct *
from have;
quit;
And of course in the case of BY _ALL_, keywords NODUPKEY and NODUPRECS are the same.
The problem with BY _ALL_, as with SELECT *, is that, while duplicates are removed, you may not end up with the desired record order, for instance when ID is not the first variable in the data set. That's when you might choose to precede _ALL_ with the variables you want controlling the output order.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@dapenDaniel wrote:
Hello. I have a data set with 2 variables (ID and Var1). There are some duplicates (red color in the following example) in this data sets.
The left is what I have and the right one is what I want.
ID Var1 ID Var1
101 aa 101 aa
101 ab 101 ab
101 ac 101 ac
101 ad 101 ad
102 ca 102 ca
103 da 103 da
101 aa
101 ab
101 ac
101 ad
The code I am using is
Proc sort data=have out=want nodupkey; by ID Var1; run;
However, I found that only one duplicate observation was deleted and the reminaing duplicates are still in the data set. Did I use the wrong code? Thanks.
We can't tell if you used the wrong code because you have not shown what you actually got for a result. If you did not get the result you show it may mean that your data is not quite what you think. If a value has a leading space then ' aa' is not equal to 'aa' and would not be deleted. This is a potential issue with any of the variables on the BY statement. Another is if a value is numeric the assigned format might not be displaying decimal values that would show up as not equal.
Example of this last issue:
data example; input x; format x f1.; datalines; 1.5 1.6 1.7 ; run;
If you print or view the Example data set created above all of the values will appear as 2 because of the format but the actual values are different.
This is one reason we recommend posting data as an actual data step. Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you. There is indeed some leading spaces. I have solved this problem.