proc sort nodupkey data=data;by _ALL_; run;
does not work. the log just says the data is already sorted. it does not dedupe even though I'm seeing duplicate rows. why does this not work?
Actual compare two of the observations that LOOK the same to see what differences there are.
So assuming the two observations that look like a duplicates are number 5 and number 6 you could run this code to compare them.
data one two;
set data firstobs=5 obs=6;
if _n_=1 then output one;
else output two;
run;
proc compare data=one compare=two;
run;
Based on my testing SAS is actually smart enough to determine that the source data has already been sorted with option nodupkey.
Have a look at below test case.
data test;
do i=3,8,4,3,3;
var='x';
output;
end;
run;
proc sort data=test;
by _all_;
run;
proc sort data=test nodupkey;
by _all_;
run;
proc sort data=test nodupkey;
by _all_;
run;
28 data test; 29 do i=3,8,4,3,3; 30 var='x'; 31 output; 32 end; 33 run; NOTE: The data set WORK.TEST has 5 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 34 35 proc sort data=test; 36 by _all_; 37 run; NOTE: There were 5 observations read from the data set WORK.TEST. NOTE: The data set WORK.TEST has 5 observations and 2 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 38 39 proc sort data=test nodupkey; 40 by _all_; 41 run; NOTE: There were 5 observations read from the data set WORK.TEST. NOTE: 2 observations with duplicate key values were deleted. NOTE: The data set WORK.TEST has 3 observations and 2 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 42 43 proc sort data=test nodupkey; 44 by _all_; 45 run; NOTE: Input data set is already sorted, no sorting done. NOTE: PROCEDURE SORT used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
Proc Sort is already around since so long that I would have been very astonished if there still would have been such a fundamental bug in it.
If you run a Proc Contents against the table sorted with nodupkey then you'll see that SAS stores in the table metadata that the table is sorted, by which variables in which order and with which options.
@toomanystepsint wrote:
proc sort nodupkey data=data;by _ALL_; run;
does not work. the log just says the data is already sorted. it does not dedupe even though I'm seeing duplicate rows. why does this not work?
Just adding to my previous post: I bet that you don't have real duplicate rows after running proc sort as posted above.
One reason that can lead to rows that look like duplicates are non printable characters.
If you want to "force" a resort then you could just create a new table and sort this one.
proc sort nodupkey data=data;by _ALL_; run;
data take2;
set data;
run;
proc sort nodupkey data=take2;by _ALL_; run;
By creating a new dataset the metadata sort information will get lost and though Proc Sort will resort the new table.
You will get the exactly same result but sometimes seeing is believing.
The formats assigned to variables can have different values, used by Proc Sort, that appear identical to users.
A simple example:
data example; input x; format x 5.1; datalines; 1.1 1.11 1.111 1.1111 ; proc sort data=example out=sorted nodupkey; by x; run;
The default format assigned to x only displays 1 decimal (rounding as needed). ALL the values with that format appear the same but none are. So the sort does not remove any of them.
Other types of formats can have different appearance but similar result.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.