SAS Enterprise Guide

Weezy_bks · Posted 05-25-2022 11:50 AM

I've got a table named test_1 :

ID	DATE_1	DATE_2	nb_1	car_1	car_2	LIB_car_1	DATE_3	DATE_4	sld
10033762900	01Jan1960	01Jan1960	0	SNA	AMEP	Contrat validé	16Sep2021	16Sep2021	42 400,00
10033762900	01Jan1960	01Jan1960	0	SNA	AMEP	Contrat validé	16Sep2021	16Sep2021	42 400,00
10033844470	01Oct2021	30Sep2024	36	SNA	AAML	Contrat mis en loyer	24Dec2021	20Dec2021	42 400,00
10033844470	01Oct2021	30Sep2024	36	SNA	AAML	Contrat mis en loyer	24Dec2021	20Dec2021	42 400,00
10033844500	14Sep2021	13Sep2024	36	SNA	AAML	Contrat mis en loyer	20Oct2021	13Oct2021	42 400,00
10033844500	14Sep2021	13Sep2024	36	SNA	AAML	Contrat mis en loyer	20Oct2021	13Oct2021	42 400,00

IDDATE_1DATE_2nb_1car_1car_2LIB_car_1DATE_3DATE_4sld1003376290001Jan196001Jan19600SNAAMEPContrat validé16Sep202116Sep202142 400,001003376290001Jan196001Jan19600SNAAMEPContrat validé16Sep202116Sep202142 400,001003384447001Oct202130Sep202436SNAAAMLContrat mis en loyer24Dec202120Dec202142 400,001003384447001Oct202130Sep202436SNAAAMLContrat mis en loyer24Dec202120Dec202142 400,001003384450014Sep202113Sep202436SNAAAMLContrat mis en loyer20Oct202113Oct202142 400,001003384450014Sep202113Sep202436SNAAAMLContrat mis en loyer20Oct202113Oct202142 400,00

I'm running a simple proc sort with NODUPRECS option :

proc sort data = test_1 NODUPRECS out = data_in dupout = data_out; 
by _ALL_;
run;

The dupout result is :

Spoiler

ID

DATE_1

DATE_2

nb_1

car_1

car_2

LIB_car_1

DATE_3

DATE_4

sld

an empty table.

Shouldn't there be three records in the output file?

Tom · Posted 05-27-2022 09:41 AM

Avoid what?

Your actual data does not have duplicates.

You can check which variables are the issue by experimenting with using the DROP= dataset option to eliminate some of the variables from the input.

So for example if did

proc sort data=have(drop=num1) nodupkey out=nodups;
  by _all_;
run;

And now some duplicates are dropped then it was NUM1 that was the problem.

If the issue is with a numeric variable (date variables are numeric) and the variables have fractional values then perhaps the differences are below the level of precision you can see in the printout. So use the ROUND() function. For example if you only care about the values to two decimal places you could do:

data clean;
  set have;
  num1 = round(num1,0.01);
run;

And then try your sort nodupkey on the new cleaned up dataset.

If the variable that is causing non-duplicates is character then perhaps you have non-visible character differences. Such as tabs, linefeeds, non-breaking spaces. Or perhaps the format width is shorter than the variable length so the full values are not printing.

View solution in original post

Reeza · Posted 05-25-2022 11:58 AM

Show the log as well.

Weezy_bks · Posted 05-27-2022 08:49 AM

30         options obs=max;
31         options compress=yes;
32         
33         
34         proc sort data = test_1 NODUPRECS out = data_in dupout = data_out;
35         by _ALL_;
36         run;

NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent. 
      Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds

ballardw · Posted 05-25-2022 12:03 PM

What does the LOG show when you run the code.

Note that NODUPRECS is no longer documented for Proc Sort. Plus this note from the documentation:

The DUPOUT= option can be used only with the NODUPKEY option. It cannot be combined with the NOUNIQUEKEY option.

So try NODUPKEY and see if that gets what you expect.

Tom · Posted 05-25-2022 12:49 PM

@ballardw wrote:

What does the LOG show when you run the code.

Note that NODUPRECS is no longer documented for Proc Sort. Plus this note from the documentation:

The DUPOUT= option can be used only with the NODUPKEY option. It cannot be combined with the NOUNIQUEKEY option.

So try NODUPKEY and see if that gets what you expect.

Good point. NODUPRECS does not do what most people expect.

BUT in this case since the code is using BY _ALL_ it means that NODUPRECS and NODUPKEY will do the same thing.

Weezy_bks · Posted 05-27-2022 08:53 AM

log with NODUPRECS option :

30 options obs=max;
31 options compress=yes;
32
33
34 proc sort data = test_1 NODUPRECS out = data_in dupout = data_out;
35 by _ALL_;
36 run;

NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds

by replacing NODUPRECS by NODUPKEY :

options obs=max;
31 options compress=yes;
32
33
34 proc sort data = test_1 nodupkey out = data_in dupout = data_out;
35 by _ALL_;
36 run;

NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 observations with duplicate key values were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds

Kurt_Bremser · Posted 05-27-2022 03:21 PM

@Weezy_bks wrote:
log with NODUPRECS option :

30 options obs=max;
31 options compress=yes;
32
33
34 proc sort data = test_1 NODUPRECS out = data_in dupout = data_out;
35 by _ALL_;
36 run;

NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds

by replacing NODUPRECS by NODUPKEY :

options obs=max;
31 options compress=yes;
32
33
34 proc sort data = test_1 nodupkey out = data_in dupout = data_out;
35 by _ALL_;
36 run;

NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 observations with duplicate key values were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds

The case is very clear: there are NO duplicates when all 21 variables are considered.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Weezy_bks · Posted 05-27-2022 09:02 AM

I have the same result with NODUPKEY

Tom · Posted 05-25-2022 12:10 PM

Your posted data has duplicates:

557  proc sort data = have nodupkey out = data_in dupout = data_out;
558    by _ALL_;
559  run;

NOTE: There were 6 observations read from the data set WORK.HAVE.
NOTE: 3 observations with duplicate key values were deleted.
NOTE: The data set WORK.DATA_IN has 3 observations and 10 variables.
NOTE: The data set WORK.DATA_OUT has 3 observations and 10 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

Most likely your real data is different in some slight way that is not visible in the printout you posted.

For example try this dataset where a fractional part of a day is added to one of the date variables.

data have;
  infile cards dsd dlm='|' truncover;
  input ID :$15. (DATE_1 DATE_2) (:date.) nb_1 car_1 $ car_2 $ LIB_car_1 :$30. (DATE_3 DATE_4) (:date.)
     sld :commax. 
  ;
  format date_1-date_4 date9.;
  date_1 + (_n_)/100 ;
cards;
10033762900|01Jan1960|01Jan1960|0|SNA|AMEP|Contrat validé|16Sep2021|16Sep2021|42 400,00
10033844470|01Oct2021|30Sep2024|36|SNA|AAML|Contrat mis en loyer|24Dec2021|20Dec2021|42 400,00
10033762900|01Jan1960|01Jan1960|0|SNA|AMEP|Contrat validé|16Sep2021|16Sep2021|42 400,00
10033844500|14Sep2021|13Sep2024|36|SNA|AAML|Contrat mis en loyer|20Oct2021|13Oct2021|42 400,00
10033844470|01Oct2021|30Sep2024|36|SNA|AAML|Contrat mis en loyer|24Dec2021|20Dec2021|42 400,00
10033844500|14Sep2021|13Sep2024|36|SNA|AAML|Contrat mis en loyer|20Oct2021|13Oct2021|42 400,00
;

Results:

595  proc sort data = have nodupkey out = data_in dupout = data_out;
596    by _ALL_;
597  run;

NOTE: There were 6 observations read from the data set WORK.HAVE.
NOTE: 0 observations with duplicate key values were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 10 variables.
NOTE: The data set WORK.DATA_OUT has 0 observations and 10 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

Weezy_bks · Posted 05-27-2022 08:58 AM

Do you have a solution to avoid this?

thanks

FreelanceReinh · Posted 05-27-2022 09:10 AM

Thanks, @Weezy_bks, for showing the logs. Now let us look at the data "with SAS's eyes" so as to understand why SAS doesn't find any duplicates (and to make the slight differences visible that Tom mentioned). Please show us the output of this PROC PRINT step:

proc print data=test_1 width=min;
format _numeric_ hex16. _char_ $hex40.;
run;

Tom · Posted 05-27-2022 09:41 AM

Avoid what?

Your actual data does not have duplicates.

You can check which variables are the issue by experimenting with using the DROP= dataset option to eliminate some of the variables from the input.

So for example if did

proc sort data=have(drop=num1) nodupkey out=nodups;
  by _all_;
run;

And now some duplicates are dropped then it was NUM1 that was the problem.

If the issue is with a numeric variable (date variables are numeric) and the variables have fractional values then perhaps the differences are below the level of precision you can see in the printout. So use the ROUND() function. For example if you only care about the values to two decimal places you could do:

data clean;
  set have;
  num1 = round(num1,0.01);
run;

And then try your sort nodupkey on the new cleaned up dataset.

If the variable that is causing non-duplicates is character then perhaps you have non-visible character differences. Such as tabs, linefeeds, non-breaking spaces. Or perhaps the format width is shorter than the variable length so the full values are not printing.

SAS Enterprise Guide

Proc sort with NODUPRECS option not working

Re: Proc sort with NODUPRECS option not working

Re: Proc sort with NODUPRECS option not working

Re: Proc sort with NODUPRECS option not working

Re: Proc sort with NODUPRECS option not working

Re: Proc sort with NODUPRECS option not working

Re: Proc sort with NODUPRECS option not working

Re: Proc sort with NODUPRECS option not working

Re: Proc sort with NODUPRECS option not working

Re: Proc sort with NODUPRECS option not working

Re: Proc sort with NODUPRECS option not working

Re: Proc sort with NODUPRECS option not working

Re: Proc sort with NODUPRECS option not working

custom sort

Can you sort proc mi missing data patterns table?

SQL: Problem getting Barplot to sort?

Working with Risk Pipelines in SAS Risk Engine

PROC REPORT with default order option creates unwanted order

Follow Us

What is...

SAS Enterprise Guide

Join us for our biggest event of the year!

SAS Training: Just a Click Away

Follow Us

What is...