BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dapenDaniel
Obsidian | Level 7

Hello. I have a data set with 2 variables (ID and Var1). There are some duplicates (red color in the following example) in this data sets.

The left is what I have and the right one is what I want.

 

ID               Var1                              ID             Var1

 

101             aa                                101              aa

101             ab                                101              ab

101             ac                                101              ac

101             ad                                101              ad

102             ca                                102              ca

103             da                                103              da

101             aa

101             ab

101             ac

101             ad

 

The code I am using is 

Proc sort data=have out=want nodupkey;
by ID Var1;
run;

However, I found that only one duplicate observation was deleted and the reminaing duplicates are still in the data set. Did I use the wrong code? Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@dapenDaniel wrote:

Hello. I have a data set with 2 variables (ID and Var1). There are some duplicates (red color in the following example) in this data sets.

The left is what I have and the right one is what I want.

 

ID               Var1                              ID             Var1

 

101             aa                                101              aa

101             ab                                101              ab

101             ac                                101              ac

101             ad                                101              ad

102             ca                                102              ca

103             da                                103              da

101             aa

101             ab

101             ac

101             ad

 

The code I am using is 

Proc sort data=have out=want nodupkey;
by ID Var1;
run;

However, I found that only one duplicate observation was deleted and the reminaing duplicates are still in the data set. Did I use the wrong code? Thanks.


We can't tell if you used the wrong code because you have not shown what you actually got for a result. If you did not get the result you show it may mean that your data is not quite what you think. If a  value has a leading space then ' aa' is not equal to 'aa' and would not be deleted. This is a potential issue with any of the variables on the BY statement. Another is if a value is numeric the assigned format might not be displaying decimal values that would show up as not equal.

Example of this last issue:

data example;
   input x;
   format x f1.;
datalines;
1.5
1.6
1.7
;
run;

If you print or view the Example data set created above all of the values will appear as 2 because of the format but the actual values are different.

 

This is one reason we recommend posting data as an actual data step.  Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

View solution in original post

8 REPLIES 8
PeterClemmensen
Tourmaline | Level 20

When I run

 

data have;
input ID Var1 $;
datalines;
101 aa
101 ab
101 ac
101 ad
102 ca
103 da
101 aa
101 ab
101 ac
101 ad
;

proc sort data=have out=want nodupkey;
   by ID Var1;
run;

 

The log says

 

Capture.PNG

 

 

Jagadishkatam
Amethyst | Level 16

Alternatively with first. in the data step.

 

data have;
input ID Var1 $;
datalines;
101 aa
101 ab
101 ac
101 ad
102 ca
103 da
101 aa
101 ab
101 ac
101 ad
;

proc sort data=have;
by ID Var1;
run;

data want;
set have;
by ID Var1;
if first.Var1;
run;
Thanks,
Jag
novinosrin
Tourmaline | Level 20
data have;
input ID Var1 $;
datalines;
101 aa
101 ab
101 ac
101 ad
102 ca
103 da
101 aa
101 ab
101 ac
101 ad
;



proc sort data =have out= want noduprecs;
by id var1 ;
run;

/*or*/

proc sql;
create table want as
select distinct *
from have;
quit;
mkeintz
PROC Star

@novinosrin

 

Be careful recommending NODUPRECS ... because it only deletes duplicate records that are adjacent to each other.  For instance the following proc sort does NOT delete any records, even though records 1 and 3 are duplicates:

 

data have;
  input id x y z;
datalines;
1 1 1 1
1 2 2 2
1 1 1 1
run;
proc sort data=have out=sorted_have noduprecs;
  by id;
run;

 

I suggest keeping nodupkey, but use the variable proxy _ALL_ as the last sort key, as in

proc sort data=have out=sorted_have_all nodupkeys;
  by id _ALL_;
run;

 

The technique here is to list all the desired primary sort keys first, and then append _ALL_ at the end of the by-list.  It doesn't matter that ID is both the primary sort key, and also a subordinate sort key because it is part of _ALL_.  The subordinate usage will have no impact on resulting sort order.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
novinosrin
Tourmaline | Level 20

Thank you @mkeintz . I know and love it when you've helped and continue to help me learn numerous times.  What I assumed was OP wanting select unique records and my understanding of noduprecs( exclude duplicate records and pick one record) rather than dup keys which is akin to select distinct * in proc sql, and hence i posted that lazy version too.

 

Nonetheless when and if you participate, i tend to pay sincere attention and kept thinking. So basically I think

proc sort 

by _all_ 

with a noduprecs 

should be equal to

proc sort 

by all variables listed one after another with a nodupkey

 

Also OP's requirement seems to pick a combination of all vars as unique obs than based keys unless of course all keys making it a composite combination. 

 


data have;
  input id x y z;
datalines;
1 1 1 1
1 2 2 2
1 1 1 1
;
run;

proc sort noduprecs;
by _all_;
run;

I wonder if my understanding is correct, distinct sql is perhaps the easiest/laziest and convenient way to go 🙂 Btw, my apologies for the late acknowledgement as I slept like a baby all weekend and barely moved. I hope you have a nice day and week ahead. 

 

 

 

mkeintz
PROC Star

@novinosrin:

 

Yes.  To my knowledge the code

 

proc sort data=have out=want noduprecs;
  by _all_;
run;

produces the same results as

 

proc sql ;
  create table want as
  select distinct *
  from have;
quit;

 

And of course in the case of BY _ALL_, keywords NODUPKEY and NODUPRECS are the same.

 

The problem with BY _ALL_, as with SELECT *, is that, while duplicates are removed, you may not end up with the desired record order, for instance when ID is not the first variable in the data set.  That's when you might choose to precede _ALL_ with the variables you want controlling the output order.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ballardw
Super User

@dapenDaniel wrote:

Hello. I have a data set with 2 variables (ID and Var1). There are some duplicates (red color in the following example) in this data sets.

The left is what I have and the right one is what I want.

 

ID               Var1                              ID             Var1

 

101             aa                                101              aa

101             ab                                101              ab

101             ac                                101              ac

101             ad                                101              ad

102             ca                                102              ca

103             da                                103              da

101             aa

101             ab

101             ac

101             ad

 

The code I am using is 

Proc sort data=have out=want nodupkey;
by ID Var1;
run;

However, I found that only one duplicate observation was deleted and the reminaing duplicates are still in the data set. Did I use the wrong code? Thanks.


We can't tell if you used the wrong code because you have not shown what you actually got for a result. If you did not get the result you show it may mean that your data is not quite what you think. If a  value has a leading space then ' aa' is not equal to 'aa' and would not be deleted. This is a potential issue with any of the variables on the BY statement. Another is if a value is numeric the assigned format might not be displaying decimal values that would show up as not equal.

Example of this last issue:

data example;
   input x;
   format x f1.;
datalines;
1.5
1.6
1.7
;
run;

If you print or view the Example data set created above all of the values will appear as 2 because of the format but the actual values are different.

 

This is one reason we recommend posting data as an actual data step.  Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

dapenDaniel
Obsidian | Level 7

Thank you. There is indeed some leading spaces. I have solved this problem.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 2095 views
  • 1 like
  • 6 in conversation