BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Weezy_bks
Calcite | Level 5

I've got a table named test_1 :

 

Spoiler
IDDATE_1DATE_2nb_1car_1car_2LIB_car_1DATE_3DATE_4sld
1003376290001Jan196001Jan19600SNAAMEPContrat validé16Sep202116Sep202142 400,00
1003376290001Jan196001Jan19600SNAAMEPContrat validé16Sep202116Sep202142 400,00
1003384447001Oct202130Sep202436SNAAAMLContrat mis en loyer24Dec202120Dec202142 400,00
1003384447001Oct202130Sep202436SNAAAMLContrat mis en loyer24Dec202120Dec202142 400,00
1003384450014Sep202113Sep202436SNAAAMLContrat mis en loyer20Oct202113Oct202142 400,00
1003384450014Sep202113Sep202436SNAAAMLContrat mis en loyer20Oct202113Oct202142 400,00

I'm running a simple proc sort with NODUPRECS option :

 

proc sort data = test_1 NODUPRECS out = data_in dupout = data_out; 
by _ALL_;
run;

The dupout result is :

 

Spoiler
IDDATE_1DATE_2nb_1car_1car_2LIB_car_1DATE_3DATE_4sld
an empty table.
 
Shouldn't there be three records in the output file?

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Avoid what?

Your actual data does not have duplicates.

You can check which variables are the issue by experimenting with using the DROP= dataset option to eliminate some of the variables from the input.

So for example if did 

proc sort data=have(drop=num1) nodupkey out=nodups;
  by _all_;
run;

And now some duplicates are dropped then it was NUM1 that was the problem.

 

If the issue is with a numeric variable (date variables are numeric) and the variables have fractional values then perhaps the differences are below the level of precision you can see in the printout.  So use the ROUND() function.  For example if you only care about the values to two decimal places you could do:

data clean;
  set have;
  num1 = round(num1,0.01);
run;

And then try your sort nodupkey on the new cleaned up dataset.

 

If the variable that is causing non-duplicates is character then perhaps you have non-visible character differences.  Such as tabs, linefeeds, non-breaking spaces.  Or perhaps the format width is shorter than the variable length so the full values are not printing.

View solution in original post

11 REPLIES 11
Reeza
Super User
Show the log as well.
Weezy_bks
Calcite | Level 5
30         options obs=max;
31         options compress=yes;
32         
33         
34         proc sort data = test_1 NODUPRECS out = data_in dupout = data_out;
35         by _ALL_;
36         run;

NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent. 
      Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds
ballardw
Super User

What does the LOG show when you run the code.

 

Note that NODUPRECS is no longer documented for Proc Sort. Plus this note from the documentation:

The DUPOUT= option can be used only with the NODUPKEY option. It cannot be combined with the NOUNIQUEKEY option.

 

So try NODUPKEY and see if that gets what you expect.

Tom
Super User Tom
Super User

@ballardw wrote:

What does the LOG show when you run the code.

 

Note that NODUPRECS is no longer documented for Proc Sort. Plus this note from the documentation:

The DUPOUT= option can be used only with the NODUPKEY option. It cannot be combined with the NOUNIQUEKEY option.

 

So try NODUPKEY and see if that gets what you expect.


Good point.  NODUPRECS does not do what most people expect.

 

BUT in this case since the code is using BY _ALL_ it means that NODUPRECS and NODUPKEY will do the same thing.

Weezy_bks
Calcite | Level 5
log with NODUPRECS option :

30 options obs=max;
31 options compress=yes;
32
33
34 proc sort data = test_1 NODUPRECS out = data_in dupout = data_out;
35 by _ALL_;
36 run;

NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds

by replacing NODUPRECS by NODUPKEY :

options obs=max;
31 options compress=yes;
32
33
34 proc sort data = test_1 nodupkey out = data_in dupout = data_out;
35 by _ALL_;
36 run;

NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 observations with duplicate key values were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
Kurt_Bremser
Super User

@Weezy_bks wrote:
log with NODUPRECS option :

30 options obs=max;
31 options compress=yes;
32
33
34 proc sort data = test_1 NODUPRECS out = data_in dupout = data_out;
35 by _ALL_;
36 run;

NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds

by replacing NODUPRECS by NODUPKEY :

options obs=max;
31 options compress=yes;
32
33
34 proc sort data = test_1 nodupkey out = data_in dupout = data_out;
35 by _ALL_;
36 run;

NOTE: There were 6 observations read from the data set WORK.TEST_1.
NOTE: 0 observations with duplicate key values were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 21 variables.
NOTE: Compressing data set WORK.DATA_IN increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: The data set WORK.DATA_OUT has 0 observations and 21 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds

The case is very clear: there are NO duplicates when all 21 variables are considered.

Weezy_bks
Calcite | Level 5
I have the same result with NODUPKEY
Tom
Super User Tom
Super User

Your posted data has duplicates:

557  proc sort data = have nodupkey out = data_in dupout = data_out;
558    by _ALL_;
559  run;

NOTE: There were 6 observations read from the data set WORK.HAVE.
NOTE: 3 observations with duplicate key values were deleted.
NOTE: The data set WORK.DATA_IN has 3 observations and 10 variables.
NOTE: The data set WORK.DATA_OUT has 3 observations and 10 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

Most likely your real data is different in some slight way that is not visible in the printout you posted.

 

For example try this dataset where a fractional part of a day is added to one of the date variables.

data have;
  infile cards dsd dlm='|' truncover;
  input ID :$15. (DATE_1 DATE_2) (:date.) nb_1 car_1 $ car_2 $ LIB_car_1 :$30. (DATE_3 DATE_4) (:date.)
     sld :commax. 
  ;
  format date_1-date_4 date9.;
  date_1 + (_n_)/100 ;
cards;
10033762900|01Jan1960|01Jan1960|0|SNA|AMEP|Contrat validé|16Sep2021|16Sep2021|42 400,00
10033844470|01Oct2021|30Sep2024|36|SNA|AAML|Contrat mis en loyer|24Dec2021|20Dec2021|42 400,00
10033762900|01Jan1960|01Jan1960|0|SNA|AMEP|Contrat validé|16Sep2021|16Sep2021|42 400,00
10033844500|14Sep2021|13Sep2024|36|SNA|AAML|Contrat mis en loyer|20Oct2021|13Oct2021|42 400,00
10033844470|01Oct2021|30Sep2024|36|SNA|AAML|Contrat mis en loyer|24Dec2021|20Dec2021|42 400,00
10033844500|14Sep2021|13Sep2024|36|SNA|AAML|Contrat mis en loyer|20Oct2021|13Oct2021|42 400,00
;

Results:

595  proc sort data = have nodupkey out = data_in dupout = data_out;
596    by _ALL_;
597  run;

NOTE: There were 6 observations read from the data set WORK.HAVE.
NOTE: 0 observations with duplicate key values were deleted.
NOTE: The data set WORK.DATA_IN has 6 observations and 10 variables.
NOTE: The data set WORK.DATA_OUT has 0 observations and 10 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
Weezy_bks
Calcite | Level 5
Do you have a solution to avoid this?

thanks
FreelanceReinh
Jade | Level 19

Thanks, @Weezy_bks, for showing the logs. Now let us look at the data "with SAS's eyes" so as to understand why SAS doesn't find any duplicates (and to make the slight differences visible that Tom mentioned). Please show us the output of this PROC PRINT step:

proc print data=test_1 width=min;
format _numeric_ hex16. _char_ $hex40.;
run;
Tom
Super User Tom
Super User

Avoid what?

Your actual data does not have duplicates.

You can check which variables are the issue by experimenting with using the DROP= dataset option to eliminate some of the variables from the input.

So for example if did 

proc sort data=have(drop=num1) nodupkey out=nodups;
  by _all_;
run;

And now some duplicates are dropped then it was NUM1 that was the problem.

 

If the issue is with a numeric variable (date variables are numeric) and the variables have fractional values then perhaps the differences are below the level of precision you can see in the printout.  So use the ROUND() function.  For example if you only care about the values to two decimal places you could do:

data clean;
  set have;
  num1 = round(num1,0.01);
run;

And then try your sort nodupkey on the new cleaned up dataset.

 

If the variable that is causing non-duplicates is character then perhaps you have non-visible character differences.  Such as tabs, linefeeds, non-breaking spaces.  Or perhaps the format width is shorter than the variable length so the full values are not printing.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1423 views
  • 5 likes
  • 6 in conversation