Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- SAS Procedures
- /
- Can you please explain this log?

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 12-05-2014 01:12 AM
(1178 views)

Hi guys,

First of all let me clarify that I am executing my code in SAS version 8.2

Have a look at the 2 scenarios below. I am basically trying to eliminate exact duplicate records from dataset "abc", keeping/dropping certain variables and also applying a filter. I understand why the first first proc sort on data abc does not delete duplicates. It is definitely because the "by" variables used do not ensure that exact duplicate records get arranged in a sequential order and hence do not get deleted.

What I do not understand is the output of the second proc sort on data "def". In both the scenarios this second proc sort is exactly the same. Why then does it not eliminate duplicates in scenario 1 whereas delete in scenario 2?

I am pretty sure I am missing something really basic here

**Scenario 1:**

proc sort data = abc (keep = var1 var2 var3 var4 var5 var6 var7 var8)

out = def (drop = var5) **noduprec**;

by var1 var2 var3 var4;

where upcase(compbl(var5)) = "SOME TEXT" and var6 = 9999;

run;

**NOTE: 0 duplicate observations were deleted.**

NOTE: There were 43 observations read from the data set ABC.

WHERE (UPCASE(COMPBL(var5))='SOME TEXT') and (var6=9999);

NOTE: The data set WORK.DEF has 43 observations and 7 variables.

NOTE: PROCEDURE SORT used:

real time 0.08 seconds

cpu time 0.08 seconds

proc sort data = def out= ghi **noduprec**;

by var1 var2 var3;

run;

**NOTE: Input data set is already sorted; it has been copied to the output data set.**

NOTE: There were 43 observations read from the data set WORK.DEF

NOTE: The data set WORK.GHI has 43 observations and 7 variables.

NOTE: PROCEDURE SORT used:

real time 0.00 seconds

cpu time 0.00 seconds

**Scenario 2:**

proc sort data = abc (keep = var1 var2 var3 var4 var5 var6 var7 var8)

out = def (drop = var5) ; ** noduprec not used here**

by var1 var2 var3 var4;

where upcase(compbl(var5)) = "SOME TEXT" and var6 = 9999;

run;

NOTE: There were 43 observations read from the data set ABC.

WHERE (UPCASE(COMPBL(var5))='SOME TEXT') and (var6=9999);

NOTE: The data set WORK.DEF has 43 observations and 7 variables.

NOTE: PROCEDURE SORT used:

real time 0.08 seconds

cpu time 0.08 seconds

proc sort data = def out= ghi **noduprec**;

by var1 var2 var3;

run;

**NOTE: 5 duplicate observations were deleted.**

NOTE: There were 43 observations read from the data set WORK.DEF

NOTE: The data set WORK.GHI has 38 observations and 7 variables.

NOTE: PROCEDURE SORT used:

real time 0.01 seconds

cpu time 0.01 seconds

4 REPLIES 4

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I believe you're not showing us the full picture here. How else would it be possible that you're dropping "VAR1" for output data set "def" but then can sort by "VAR1" on input data set "DEF"?

proc sort data = abc (keep = var1 var2 var3 var4 var5 var6 var7 var8)

out = def (drop = var1) **noduprec**;

by var1 var2 var3 var4;

where upcase(compbl(var5)) = "SOME TEXT" and var6 = 9999;

run;

**NOTE: 0 duplicate observations were deleted.**

NOTE: There were 43 observations read from the data set ABC.

WHERE (UPCASE(COMPBL(var5))='SOME TEXT') and (var6=9999);

NOTE: The data set WORK.DEF has 43 observations and 10 variables.

NOTE: PROCEDURE SORT used:

real time 0.08 seconds

cpu time 0.08 seconds

proc sort data = def out= ghi **noduprec**;

by var1 var2 var3;

run;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Patrick,

You are absolutely correct. Problems with data anonymization!! I replaced the dataset names and variables names from the actual dataset/variables names that I was using in production data. I have now updated my question text....I also realized that my log note had 10 variables whereas the input itself had 8 here

sorry for the trouble

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Without seeing the actual data one can only guess! However, your main problem is likely because you are using NODUPREC. In all of my years of using SAS I have never seen a situation where that would be advisable.

Compare the definitions of NODUPREC and NODUPKEY in the documentation (e.g., Base SAS(R) 9.2 Procedures Guide),

I ALWAYS use NODUPKEY.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Ameet,

This tip is the reason Arthur (and most of us) don't use NODUPREC:

"Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations might remain in the output data set. You can remove all duplicates with this option by sorting on all variables."

If you have sorted by all variables, you might as well use NODUPKEY.

**SAS Innovate 2025** is scheduled for May 6-9 in Orlando, FL. Sign up to be **first to learn** about the agenda and registration!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Ready to level-up your skills? Choose your own adventure.