Solved: proc sql and data step for testing duplications are giving differet nu...

mamin088 · Posted 06-04-2019 01:33 PM

Hi everybody, I'm trying to find the duplications in a data set using two different codes. But I get different #observations! Would somebody explain it to me why?! here are the codes:

proc sql; create table SINGLE as select unique VARIABLE from DATASET; quit; run;

or

proc sort data=DATASET; by VARIABLE; run;

data dup nodup blank;

set DATASET;

by VARIABLE;

if first.VARIABLE and last.VARIABLE then output nodup;

else if VARIABLE="" then output blank;

else output dup;

run;

but I get different # of observations for table SINGLE and data dup.

Thank you!

Tom · Posted 06-04-2019 02:11 PM

The DISTINCT is giving you distinct OBSERVATIONS. But since your select statement only has one variable the effect is distinct VALUES of that variable. No matter how many times that value appears the output includes it only once.

In your data step FIRST.VARIABLE and LAST.VARIABLE will be true for values of VARIABLE that appear only ONCE in the data. The values that appear multiple times will be written to one of the other datasets.

Consider this example:

data have;
  input variable ;
cards;
1
1
2
3
3
4
;

The distinct set of values is 1,2,3,4. The values that appear only once are 2 and 4.

View solution in original post

Tom · Posted 06-04-2019 01:40 PM

Your two codes are doing totally different things.

To replicate what your SQL code is doing just use FIRST.VARIABLE.

data dup nodup blank single2(keep=variable);
  set DATASET;
  by VARIABLE;

  if first.VARIABLE then output single2;

  if first.VARIABLE and last.VARIABLE then output nodup;
  else if VARIABLE="" then output blank;
  else output dup;

run;

mamin088 · Posted 06-04-2019 01:51 PM

Thank you! I did that and you are right the
if first.VARIABLE then output single2; replicates the SQL code but why? I don't understand the difference? SQL is supposed to give me the distinct observations.

Tom · Posted 06-04-2019 02:11 PM

The DISTINCT is giving you distinct OBSERVATIONS. But since your select statement only has one variable the effect is distinct VALUES of that variable. No matter how many times that value appears the output includes it only once.

In your data step FIRST.VARIABLE and LAST.VARIABLE will be true for values of VARIABLE that appear only ONCE in the data. The values that appear multiple times will be written to one of the other datasets.

Consider this example:

data have;
  input variable ;
cards;
1
1
2
3
3
4
;

The distinct set of values is 1,2,3,4. The values that appear only once are 2 and 4.

mamin088 · Posted 06-04-2019 02:16 PM

Got it! Thank you so much for your time!

VDD · Posted 06-04-2019 01:41 PM

that is because this code

if first.VARIABLE and last.VARIABLE then output nodup;
else if VARIABLE="" then output blank;
else output dup;

will not be the same as this code

unique VARIABLE

why are you using unique rather that distinct ?

mamin088 · Posted 06-04-2019 01:52 PM

Thank you VDD! I just searched and found that sql unique is an old syntax and is not an standard one anymore. Better to use the distinct. Thank you again!

proc sql and data step for testing duplications are giving differet numbers

Re: proc sql and data step for testing duplications are giving differet numbers

Re: proc sql and data step for testing duplications are giving differet numbers

Re: proc sql and data step for testing duplications are giving differet numbers

Re: proc sql and data step for testing duplications are giving differet numbers

Re: proc sql and data step for testing duplications are giving differet numbers

Re: proc sql and data step for testing duplications are giving differet numbers

Re: proc sql and data step for testing duplications are giving differet numbers

Classroom Training Available!