BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mamin088
Fluorite | Level 6

Hi everybody, I'm trying to find the duplications in a data set using two different codes. But I get different #observations! Would somebody explain it to me why?! here are the codes:

 

proc sql; create table SINGLE  as select unique VARIABLE from  DATASET; quit; run;

 

or

 

proc sort data=DATASET; by VARIABLE; run;

data dup nodup blank;

set DATASET;

by VARIABLE;

if first.VARIABLE and last.VARIABLE then output nodup;

else if VARIABLE="" then output blank;

else output dup;

run;

 

but I get different # of observations for table SINGLE and data dup.

 

Thank you!

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

The DISTINCT is giving you distinct OBSERVATIONS.  But since your select statement only has one variable the effect is distinct VALUES of that variable. No matter how many times that value appears the output includes it only once. 

 

In your data step FIRST.VARIABLE and LAST.VARIABLE will be true for values of VARIABLE that appear only ONCE in the data.  The values that appear multiple times will be written to one of the other datasets.

 

Consider this example:

data have;
  input variable ;
cards;
1
1
2
3
3
4
;

The distinct set of values is 1,2,3,4.  The values that appear only once are 2 and 4.

View solution in original post

6 REPLIES 6
Tom
Super User Tom
Super User

Your two codes are doing totally different things.

To replicate what your SQL code is doing just use FIRST.VARIABLE.

data dup nodup blank single2(keep=variable);
  set DATASET;
  by VARIABLE;

  if first.VARIABLE then output single2;

  if first.VARIABLE and last.VARIABLE then output nodup;
  else if VARIABLE="" then output blank;
  else output dup;

run;
mamin088
Fluorite | Level 6
Thank you! I did that and you are right the
if first.VARIABLE then output single2; replicates the SQL code but why? I don't understand the difference? SQL is supposed to give me the distinct observations.

Tom
Super User Tom
Super User

The DISTINCT is giving you distinct OBSERVATIONS.  But since your select statement only has one variable the effect is distinct VALUES of that variable. No matter how many times that value appears the output includes it only once. 

 

In your data step FIRST.VARIABLE and LAST.VARIABLE will be true for values of VARIABLE that appear only ONCE in the data.  The values that appear multiple times will be written to one of the other datasets.

 

Consider this example:

data have;
  input variable ;
cards;
1
1
2
3
3
4
;

The distinct set of values is 1,2,3,4.  The values that appear only once are 2 and 4.

mamin088
Fluorite | Level 6
Got it! Thank you so much for your time!
VDD
Ammonite | Level 13 VDD
Ammonite | Level 13

that is because this code 

if first.VARIABLE and last.VARIABLE then output nodup;
else if VARIABLE="" then output blank;
else output dup;

will not be the same as this code

unique VARIABLE

why are you using unique rather that distinct ?

 

 

 

 

 

mamin088
Fluorite | Level 6
Thank you VDD! I just searched and found that sql unique is an old syntax and is not an standard one anymore. Better to use the distinct. Thank you again!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 669 views
  • 3 likes
  • 3 in conversation