topic Re: Distinct on Select Columns in SAS Data Management

Distinct on Select Columns

jwhite — Tue, 27 May 2014 19:54:35 GMT

DIS 4.4

SAS 9.3

New Developer

So, I've got a dataset and I want to 'select distinct' only on a subset of variables. If the observation is distinct, then I want all of it's variables to pass through. For example, I have 30 variables, but I only want to check 5 of them and if those 5 are the same between two (or more) observations, than I only want one of the observations to go through, but the values for all 30 variables.

I would guess I would use an Extract transformation with the option of select distinct set to 'Yes'. However, this isn't doing it. When I only map the 5 variables to the outgoing dataset, then it works, but I only have 5 variables per observation and not the 30 I desire.

Did I make any sense? Basically:

If I have Var_1, Var_2, Var_3, ...Var_30, but want to only select on Var_3, Var_5, Var_6, Var_10, and Var_22. I think doing a 'group by' could also do the trick, but also doesn't seem to:

Select

Var_1,

Var_2,

Var_3,

...

Var_30

from TABLE

group by

Var_3,

Var_5,

Var_6,

Var_10,

Var_22

I still end up with the same number of observations.

Also, is there a way to output a dataset (even a temporary one) of all observations that were not distinct so that I can view that group as well? Say if I have 200 observations, but after selecting distinct, I have 175...is there a node/transformation I can use to view those 25 that were 'rejected'?

Thanks!

Re: Distinct on Select Columns

TomKari — Tue, 27 May 2014 21:34:21 GMT

Hi, jwhite

"Select distinct" doesn't do what you need; what it does is resolve the SQL request to the result, and then eliminate duplicates from that result, returning one row for every distinct combination of variables in the result. I'm not the world's biggest SQL expert, I suspect somebody else will come to your rescue with the correct code, but at least you know the reason it didn't work!

Tom

Re: Distinct on Select Columns

Kurt_Bremser — Wed, 28 May 2014 07:21:08 GMT

You have more control with the data step.

Let's assume you have var1 to var5, and var1 to var3 are your "criterion variables".

proc sort data=have;

by var1 var2 var3;

run;

data

want

rejected

;

set have;

by var1 var2 var3;

if first.var3

then output want;

else output rejected; *these are the duplicates;

run;

Re: Distinct on Select Columns

RW9 — Wed, 28 May 2014 08:32:38 GMT

Would dupout and nodupkey not suffice?:

data have;
attrib var1-var5 format=$20.;
infile cards;
input var1-var5;
cards;
aaa bbb ccc ddd eee fff
aaa bbb ccc ddd eee fff
bbb bbb ccc ddd eee fff
ccc bbb ccc ddd eee fff
;
run;

proc sort data=have out=want dupout=rejected nodupkey;
by var1 var2 var3;
run;

---

To note, you can also do it in SQL, unlike Kurt's solution you would have two steps, one for WANT, one for REJECTED:

proc sql;

create table WANT as

select distinct(VAR1||VAR2||VAR3),

from HAVE;

create table REJECTED as

select distinct(VAR1||VAR2||VAR3),

from (select *,

COUNT(VAR1||VAR2||VAR3) as CNT

from HAVE

group by VAR1||VAR2||VAR3)

where CNT > 1;

quit;

Re: Distinct on Select Columns

Kurt_Bremser — Wed, 28 May 2014 11:41:18 GMT

Would dupout and nodupkey not suffice?

Yes, of course, but with doing it "manually" you can have additional control which records are preferred to keep (eg first or last in time etc). One can use additional vars in the sort to set the preference.

<Spock> IMO it is easier to convey the developer's intent with the data step method. IMO. </Spock>

Re: Distinct on Select Columns

Ksharp — Wed, 28 May 2014 11:57:49 GMT

If I understand what you mean . you can do something like :

select distinct catx(' ',Var_3, Var_5, Var_6, Var_10, Var_22)