- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I've got a table named test_1 :
ID | DATE_1 | DATE_2 | nb_1 | car_1 | car_2 | LIB_car_1 | DATE_3 | DATE_4 | sld |
10033762900 | 01Jan1960 | 01Jan1960 | 0 | SNA | AMEP | Contrat validé | 16Sep2021 | 16Sep2021 | 42 400,00 |
10033762900 | 01Jan1960 | 01Jan1960 | 0 | SNA | AMEP | Contrat validé | 16Sep2021 | 16Sep2021 | 42 400,00 |
10033844470 | 01Oct2021 | 30Sep2024 | 36 | SNA | AAML | Contrat mis en loyer | 24Dec2021 | 20Dec2021 | 42 400,00 |
10033844470 | 01Oct2021 | 30Sep2024 | 36 | SNA | AAML | Contrat mis en loyer | 24Dec2021 | 20Dec2021 | 42 400,00 |
10033844500 | 14Sep2021 | 13Sep2024 | 36 | SNA | AAML | Contrat mis en loyer | 20Oct2021 | 13Oct2021 | 42 400,00 |
10033844500 | 14Sep2021 | 13Sep2024 | 36 | SNA | AAML | Contrat mis en loyer | 20Oct2021 | 13Oct2021 | 42 400,00 |
I'm running a simple proc sort with NODUPRECS option :
proc sort data = test_1 NODUPRECS out = data_in dupout = data_out;
by _ALL_;
run;
The dupout result is :
ID | DATE_1 | DATE_2 | nb_1 | car_1 | car_2 | LIB_car_1 | DATE_3 | DATE_4 | sld |
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Avoid what?
Your actual data does not have duplicates.
You can check which variables are the issue by experimenting with using the DROP= dataset option to eliminate some of the variables from the input.
So for example if did
proc sort data=have(drop=num1) nodupkey out=nodups;
by _all_;
run;
And now some duplicates are dropped then it was NUM1 that was the problem.
If the issue is with a numeric variable (date variables are numeric) and the variables have fractional values then perhaps the differences are below the level of precision you can see in the printout. So use the ROUND() function. For example if you only care about the values to two decimal places you could do:
data clean;
set have;
num1 = round(num1,0.01);
run;
And then try your sort nodupkey on the new cleaned up dataset.
If the variable that is causing non-duplicates is character then perhaps you have non-visible character differences. Such as tabs, linefeeds, non-breaking spaces. Or perhaps the format width is shorter than the variable length so the full values are not printing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
30 options obs=max; 31 options compress=yes; 32 33 34 proc sort data = test_1 NODUPRECS out = data_in dupout = data_out; 35 by _ALL_; 36 run; NOTE: There were 6 observations read from the data set WORK.TEST_1. NOTE: 0 duplicate observations were deleted. NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables. NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent. Compressed is 2 pages; un-compressed would require 1 pages. NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.01 seconds cpu time 0.00 seconds
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What does the LOG show when you run the code.
Note that NODUPRECS is no longer documented for Proc Sort. Plus this note from the documentation:
The DUPOUT= option can be used only with the NODUPKEY option. It cannot be combined with the NOUNIQUEKEY option.
So try NODUPKEY and see if that gets what you expect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@ballardw wrote:
What does the LOG show when you run the code.
Note that NODUPRECS is no longer documented for Proc Sort. Plus this note from the documentation:
The DUPOUT= option can be used only with the NODUPKEY option. It cannot be combined with the NOUNIQUEKEY option.
So try NODUPKEY and see if that gets what you expect.
Good point. NODUPRECS does not do what most people expect.
BUT in this case since the code is using BY _ALL_ it means that NODUPRECS and NODUPKEY will do the same thing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
30 options obs=max;
31 options compress=yes;
32
33
34 proc sort data = test_1 NODUPRECS out = data_in dupout = data_out;
35 by _ALL_;
36 run;
NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
by replacing NODUPRECS by NODUPKEY :
options obs=max;
31 options compress=yes;
32
33
34 proc sort data = test_1 nodupkey out = data_in dupout = data_out;
35 by _ALL_;
36 run;
NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 observations with duplicate key values were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Weezy_bks wrote:
log with NODUPRECS option :
30 options obs=max;
31 options compress=yes;
32
33
34 proc sort data = test_1 NODUPRECS out = data_in dupout = data_out;
35 by _ALL_;
36 run;
NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
by replacing NODUPRECS by NODUPKEY :
options obs=max;
31 options compress=yes;
32
33
34 proc sort data = test_1 nodupkey out = data_in dupout = data_out;
35 by _ALL_;
36 run;
NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 observations with duplicate key values were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
The case is very clear: there are NO duplicates when all 21 variables are considered.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your posted data has duplicates:
557 proc sort data = have nodupkey out = data_in dupout = data_out; 558 by _ALL_; 559 run; NOTE: There were 6 observations read from the data set WORK.HAVE. NOTE: 3 observations with duplicate key values were deleted. NOTE: The data set WORK.DATA_IN has 3 observations and 10 variables. NOTE: The data set WORK.DATA_OUT has 3 observations and 10 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
Most likely your real data is different in some slight way that is not visible in the printout you posted.
For example try this dataset where a fractional part of a day is added to one of the date variables.
data have;
infile cards dsd dlm='|' truncover;
input ID :$15. (DATE_1 DATE_2) (:date.) nb_1 car_1 $ car_2 $ LIB_car_1 :$30. (DATE_3 DATE_4) (:date.)
sld :commax.
;
format date_1-date_4 date9.;
date_1 + (_n_)/100 ;
cards;
10033762900|01Jan1960|01Jan1960|0|SNA|AMEP|Contrat validé|16Sep2021|16Sep2021|42 400,00
10033844470|01Oct2021|30Sep2024|36|SNA|AAML|Contrat mis en loyer|24Dec2021|20Dec2021|42 400,00
10033762900|01Jan1960|01Jan1960|0|SNA|AMEP|Contrat validé|16Sep2021|16Sep2021|42 400,00
10033844500|14Sep2021|13Sep2024|36|SNA|AAML|Contrat mis en loyer|20Oct2021|13Oct2021|42 400,00
10033844470|01Oct2021|30Sep2024|36|SNA|AAML|Contrat mis en loyer|24Dec2021|20Dec2021|42 400,00
10033844500|14Sep2021|13Sep2024|36|SNA|AAML|Contrat mis en loyer|20Oct2021|13Oct2021|42 400,00
;
Results:
595 proc sort data = have nodupkey out = data_in dupout = data_out; 596 by _ALL_; 597 run; NOTE: There were 6 observations read from the data set WORK.HAVE. NOTE: 0 observations with duplicate key values were deleted. NOTE: The data set WORK.DATA_IN has 6 observations and 10 variables. NOTE: The data set WORK.DATA_OUT has 0 observations and 10 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, @Weezy_bks, for showing the logs. Now let us look at the data "with SAS's eyes" so as to understand why SAS doesn't find any duplicates (and to make the slight differences visible that Tom mentioned). Please show us the output of this PROC PRINT step:
proc print data=test_1 width=min;
format _numeric_ hex16. _char_ $hex40.;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Avoid what?
Your actual data does not have duplicates.
You can check which variables are the issue by experimenting with using the DROP= dataset option to eliminate some of the variables from the input.
So for example if did
proc sort data=have(drop=num1) nodupkey out=nodups;
by _all_;
run;
And now some duplicates are dropped then it was NUM1 that was the problem.
If the issue is with a numeric variable (date variables are numeric) and the variables have fractional values then perhaps the differences are below the level of precision you can see in the printout. So use the ROUND() function. For example if you only care about the values to two decimal places you could do:
data clean;
set have;
num1 = round(num1,0.01);
run;
And then try your sort nodupkey on the new cleaned up dataset.
If the variable that is causing non-duplicates is character then perhaps you have non-visible character differences. Such as tabs, linefeeds, non-breaking spaces. Or perhaps the format width is shorter than the variable length so the full values are not printing.