Hi everybody, I'm trying to find the duplications in a data set using two different codes. But I get different #observations! Would somebody explain it to me why?! here are the codes:
proc sql; create table SINGLE as select unique VARIABLE from DATASET; quit; run;
or
proc sort data=DATASET; by VARIABLE; run;
data dup nodup blank;
set DATASET;
by VARIABLE;
if first.VARIABLE and last.VARIABLE then output nodup;
else if VARIABLE="" then output blank;
else output dup;
run;
but I get different # of observations for table SINGLE and data dup.
Thank you!
The DISTINCT is giving you distinct OBSERVATIONS. But since your select statement only has one variable the effect is distinct VALUES of that variable. No matter how many times that value appears the output includes it only once.
In your data step FIRST.VARIABLE and LAST.VARIABLE will be true for values of VARIABLE that appear only ONCE in the data. The values that appear multiple times will be written to one of the other datasets.
Consider this example:
data have;
input variable ;
cards;
1
1
2
3
3
4
;
The distinct set of values is 1,2,3,4. The values that appear only once are 2 and 4.
Your two codes are doing totally different things.
To replicate what your SQL code is doing just use FIRST.VARIABLE.
data dup nodup blank single2(keep=variable);
set DATASET;
by VARIABLE;
if first.VARIABLE then output single2;
if first.VARIABLE and last.VARIABLE then output nodup;
else if VARIABLE="" then output blank;
else output dup;
run;
The DISTINCT is giving you distinct OBSERVATIONS. But since your select statement only has one variable the effect is distinct VALUES of that variable. No matter how many times that value appears the output includes it only once.
In your data step FIRST.VARIABLE and LAST.VARIABLE will be true for values of VARIABLE that appear only ONCE in the data. The values that appear multiple times will be written to one of the other datasets.
Consider this example:
data have;
input variable ;
cards;
1
1
2
3
3
4
;
The distinct set of values is 1,2,3,4. The values that appear only once are 2 and 4.
that is because this code
if first.VARIABLE and last.VARIABLE then output nodup;
else if VARIABLE="" then output blank;
else output dup;
will not be the same as this code
unique VARIABLE
why are you using unique rather that distinct ?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.