SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Distinct on Select Columns

Reply
Frequent Contributor
Posts: 89

Distinct on Select Columns

DIS 4.4

SAS 9.3

New Developer

So, I've got a dataset and I want to 'select distinct' only on a subset of variables. If the observation is distinct, then I want all of it's variables to pass through. For example, I have 30 variables, but I only want to check 5 of them and if those 5 are the same between two (or more) observations, than I only want one of the observations to go through, but the values for all 30 variables.

I would guess I would use an Extract transformation with the option of select distinct set to 'Yes'. However, this isn't doing it. When I only map the 5 variables to the outgoing dataset, then it works, but I only have 5 variables per observation and not the 30 I desire.

Did I make any sense? Basically:

If I have Var_1, Var_2, Var_3, ...Var_30, but want to only select on Var_3, Var_5, Var_6, Var_10, and Var_22. I think doing a 'group by' could also do the trick, but also doesn't seem to:

Select

     Var_1,

     Var_2,

     Var_3,

     ...

     Var_30

from TABLE

group by    

     Var_3,

     Var_5,

     Var_6,

     Var_10,

     Var_22

I still end up with the same number of observations.

Also, is there a way to output a dataset (even a temporary one) of all observations that were not distinct so that I can view that group as well? Say if I have 200 observations, but after selecting distinct, I have 175...is there a node/transformation I can use to view those 25 that were 'rejected'?

Thanks!

PROC Star
Posts: 1,167

Re: Distinct on Select Columns

Hi, jwhite

"Select distinct" doesn't do what you need; what it does is resolve the SQL request to the result, and then eliminate duplicates from that result, returning one row for every distinct combination of variables in the result. I'm not the world's biggest SQL expert, I suspect somebody else will come to your rescue with the correct code, but at least you know the reason it didn't work!

Tom

Super User
Posts: 7,782

Re: Distinct on Select Columns

You have more control with the data step.

Let's assume you have var1 to var5, and var1 to var3 are your "criterion variables".

proc sort data=have;

by var1 var2 var3;

run;

data

  want

  rejected

;

set have;

by var1 var2 var3;

if first.var3

then output want;

else output rejected; *these are the duplicates;

run;

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Super User
Posts: 7,955

Re: Distinct on Select Columns

Posted in reply to KurtBremser

Would dupout and nodupkey not suffice?:

data have;
  attrib var1-var5 format=$20.;
  infile cards;
  input var1-var5;
cards;
aaa bbb ccc ddd eee fff
aaa bbb ccc ddd eee fff
bbb bbb ccc ddd eee fff
ccc bbb ccc ddd eee fff
;
run;

proc sort data=have out=want dupout=rejected nodupkey;
  by var1 var2 var3;
run;

---

To note, you can also do it in SQL, unlike Kurt's solution you would have two steps, one for WANT, one for REJECTED:

proc sql;

  create table WANT as

  select  distinct(VAR1||VAR2||VAR3),

          *

  from    HAVE;

  create table REJECTED as

  select  distinct(VAR1||VAR2||VAR3),

          *

  from    (select *,

                  COUNT(VAR1||VAR2||VAR3) as CNT

           from   HAVE

           group by VAR1||VAR2||VAR3)

  where   CNT > 1;

quit;

Super User
Posts: 7,782

Re: Distinct on Select Columns

Would dupout and nodupkey not suffice?

Yes, of course, but with doing it "manually" you can have additional control which records are preferred to keep (eg first or last in time etc). One can use additional vars in the sort to set the preference.

<Spock> IMO it is easier to convey the developer's intent with the data step method. IMO. </Spock>

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 10,028

Re: Distinct on Select Columns

If I understand what you mean . you can do something like :

select distinct catx(' ',Var_3, Var_5, Var_6, Var_10,  Var_22)

Ask a Question
Discussion stats
  • 5 replies
  • 3583 views
  • 1 like
  • 5 in conversation