BookmarkSubscribeRSS Feed
toomanystepsint
Fluorite | Level 6

proc sort nodupkey data=data;by _ALL_; run;

 

does not work. the log just says the data is already sorted. it does not dedupe even though I'm seeing duplicate rows. why does this not work? 

4 REPLIES 4
Tom
Super User Tom
Super User

Actual compare two of the observations that LOOK the same to see what differences there are.

So assuming the two observations that look like a duplicates are number 5 and number 6 you could run this code to compare them.

data one two;
  set data firstobs=5 obs=6;
  if _n_=1 then output one;
  else output two;
run;
proc compare data=one compare=two;
run;
Patrick
Opal | Level 21

Based on my testing SAS is actually smart enough to determine that the source data has already been sorted with option nodupkey.

Have a look at below test case.

data test;
  do i=3,8,4,3,3;
    var='x';
    output;
  end;
run;

proc sort data=test;
  by _all_;
run;

proc sort data=test nodupkey;
  by _all_;
run;

proc sort data=test nodupkey;
  by _all_;
run;
28         data test;
29           do i=3,8,4,3,3;
30             var='x';
31             output;
32           end;
33         run;

NOTE: The data set WORK.TEST has 5 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      

34         
35         proc sort data=test;
36           by _all_;
37         run;

NOTE: There were 5 observations read from the data set WORK.TEST.
NOTE: The data set WORK.TEST has 5 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      

38         
39         proc sort data=test nodupkey;
40           by _all_;
41         run;

NOTE: There were 5 observations read from the data set WORK.TEST.
NOTE: 2 observations with duplicate key values were deleted.
NOTE: The data set WORK.TEST has 3 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      

42         
43         proc sort data=test nodupkey;
44           by _all_;
45         run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Proc Sort is already around since so long that I would have been very astonished if there still would have been such a fundamental bug in it.

 

If you run a Proc Contents against the table sorted with nodupkey then you'll see that SAS stores in the table metadata that the table is sorted, by which variables in which order and with which options.

Patrick_0-1677648243727.png

 

 

Patrick
Opal | Level 21

@toomanystepsint wrote:

proc sort nodupkey data=data;by _ALL_; run;

 

does not work. the log just says the data is already sorted. it does not dedupe even though I'm seeing duplicate rows. why does this not work? 


Just adding to my previous post: I bet that you don't have real duplicate rows after running proc sort as posted above. 

One reason that can lead to rows that look like duplicates are non printable characters.

 

If you want to "force" a resort then you could just create a new table and sort this one. 

proc sort nodupkey data=data;by _ALL_; run;

data take2;
  set data;
run;

proc sort nodupkey data=take2;by _ALL_; run;

By creating a new dataset the metadata sort information will get lost and though Proc Sort will resort the new table.
You will get the exactly same result but sometimes seeing is believing.

ballardw
Super User

The formats assigned to variables can have different values, used by Proc Sort, that appear identical to users.

 

A simple example:

data example;
   input x;
   format x 5.1;
datalines;
1.1
1.11
1.111
1.1111
;

proc sort data=example out=sorted nodupkey;
   by x;
run;

The default format assigned to x only displays 1 decimal (rounding as needed). ALL the values with that format appear the same but none are. So the sort does not remove any of them.

Other types of formats can have different appearance but similar result.

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 469 views
  • 3 likes
  • 4 in conversation