Re: proc sort not deduping by all

toomanystepsint · Posted 02-28-2023 05:02 PM

proc sort nodupkey data=data;by _ALL_; run;

does not work. the log just says the data is already sorted. it does not dedupe even though I'm seeing duplicate rows. why does this not work?

Tom · Posted 02-28-2023 07:33 PM

Actual compare two of the observations that LOOK the same to see what differences there are.

So assuming the two observations that look like a duplicates are number 5 and number 6 you could run this code to compare them.

data one two;
  set data firstobs=5 obs=6;
  if _n_=1 then output one;
  else output two;
run;
proc compare data=one compare=two;
run;

Patrick · Posted 03-01-2023 12:20 AM

Based on my testing SAS is actually smart enough to determine that the source data has already been sorted with option nodupkey.

Have a look at below test case.

data test;
  do i=3,8,4,3,3;
    var='x';
    output;
  end;
run;

proc sort data=test;
  by _all_;
run;

proc sort data=test nodupkey;
  by _all_;
run;

proc sort data=test nodupkey;
  by _all_;
run;

28         data test;
29           do i=3,8,4,3,3;
30             var='x';
31             output;
32           end;
33         run;

NOTE: The data set WORK.TEST has 5 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      

34         
35         proc sort data=test;
36           by _all_;
37         run;

NOTE: There were 5 observations read from the data set WORK.TEST.
NOTE: The data set WORK.TEST has 5 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      

38         
39         proc sort data=test nodupkey;
40           by _all_;
41         run;

NOTE: There were 5 observations read from the data set WORK.TEST.
NOTE: 2 observations with duplicate key values were deleted.
NOTE: The data set WORK.TEST has 3 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      

42         
43         proc sort data=test nodupkey;
44           by _all_;
45         run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Proc Sort is already around since so long that I would have been very astonished if there still would have been such a fundamental bug in it.

If you run a Proc Contents against the table sorted with nodupkey then you'll see that SAS stores in the table metadata that the table is sorted, by which variables in which order and with which options.

Patrick · Posted 03-01-2023 02:37 AM

@toomanystepsint wrote:

proc sort nodupkey data=data;by _ALL_; run;

does not work. the log just says the data is already sorted. it does not dedupe even though I'm seeing duplicate rows. why does this not work?

Just adding to my previous post: I bet that you don't have real duplicate rows after running proc sort as posted above.

One reason that can lead to rows that look like duplicates are non printable characters.

If you want to "force" a resort then you could just create a new table and sort this one.

proc sort nodupkey data=data;by _ALL_; run;

data take2;
  set data;
run;

proc sort nodupkey data=take2;by _ALL_; run;

By creating a new dataset the metadata sort information will get lost and though Proc Sort will resort the new table.
You will get the exactly same result but sometimes seeing is believing.

ballardw · Posted 03-01-2023 10:40 AM

The formats assigned to variables can have different values, used by Proc Sort, that appear identical to users.

A simple example:

data example;
   input x;
   format x 5.1;
datalines;
1.1
1.11
1.111
1.1111
;

proc sort data=example out=sorted nodupkey;
   by x;
run;

The default format assigned to x only displays 1 decimal (rounding as needed). ALL the values with that format appear the same but none are. So the sort does not remove any of them.

Other types of formats can have different appearance but similar result.

proc sort not deduping by all