BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jwhite
Quartz | Level 8

DIS 4.4

SAS 9.3

New Developer

So, I've got a dataset and I want to 'select distinct' only on a subset of variables. If the observation is distinct, then I want all of it's variables to pass through. For example, I have 30 variables, but I only want to check 5 of them and if those 5 are the same between two (or more) observations, than I only want one of the observations to go through, but the values for all 30 variables.

I would guess I would use an Extract transformation with the option of select distinct set to 'Yes'. However, this isn't doing it. When I only map the 5 variables to the outgoing dataset, then it works, but I only have 5 variables per observation and not the 30 I desire.

Did I make any sense? Basically:

If I have Var_1, Var_2, Var_3, ...Var_30, but want to only select on Var_3, Var_5, Var_6, Var_10, and Var_22. I think doing a 'group by' could also do the trick, but also doesn't seem to:

Select

     Var_1,

     Var_2,

     Var_3,

     ...

     Var_30

from TABLE

group by    

     Var_3,

     Var_5,

     Var_6,

     Var_10,

     Var_22

I still end up with the same number of observations.

Also, is there a way to output a dataset (even a temporary one) of all observations that were not distinct so that I can view that group as well? Say if I have 200 observations, but after selecting distinct, I have 175...is there a node/transformation I can use to view those 25 that were 'rejected'?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

You have more control with the data step.

Let's assume you have var1 to var5, and var1 to var3 are your "criterion variables".

proc sort data=have;

by var1 var2 var3;

run;

data

  want

  rejected

;

set have;

by var1 var2 var3;

if first.var3

then output want;

else output rejected; *these are the duplicates;

run;

View solution in original post

5 REPLIES 5
TomKari
Onyx | Level 15

Hi, jwhite

"Select distinct" doesn't do what you need; what it does is resolve the SQL request to the result, and then eliminate duplicates from that result, returning one row for every distinct combination of variables in the result. I'm not the world's biggest SQL expert, I suspect somebody else will come to your rescue with the correct code, but at least you know the reason it didn't work!

Tom

Kurt_Bremser
Super User

You have more control with the data step.

Let's assume you have var1 to var5, and var1 to var3 are your "criterion variables".

proc sort data=have;

by var1 var2 var3;

run;

data

  want

  rejected

;

set have;

by var1 var2 var3;

if first.var3

then output want;

else output rejected; *these are the duplicates;

run;

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Would dupout and nodupkey not suffice?:

data have;
  attrib var1-var5 format=$20.;
  infile cards;
  input var1-var5;
cards;
aaa bbb ccc ddd eee fff
aaa bbb ccc ddd eee fff
bbb bbb ccc ddd eee fff
ccc bbb ccc ddd eee fff
;
run;

proc sort data=have out=want dupout=rejected nodupkey;
  by var1 var2 var3;
run;

---

To note, you can also do it in SQL, unlike Kurt's solution you would have two steps, one for WANT, one for REJECTED:

proc sql;

  create table WANT as

  select  distinct(VAR1||VAR2||VAR3),

          *

  from    HAVE;

  create table REJECTED as

  select  distinct(VAR1||VAR2||VAR3),

          *

  from    (select *,

                  COUNT(VAR1||VAR2||VAR3) as CNT

           from   HAVE

           group by VAR1||VAR2||VAR3)

  where   CNT > 1;

quit;

Kurt_Bremser
Super User

Would dupout and nodupkey not suffice?

Yes, of course, but with doing it "manually" you can have additional control which records are preferred to keep (eg first or last in time etc). One can use additional vars in the sort to set the preference.

<Spock> IMO it is easier to convey the developer's intent with the data step method. IMO. </Spock>

Ksharp
Super User

If I understand what you mean . you can do something like :

select distinct catx(' ',Var_3, Var_5, Var_6, Var_10,  Var_22)

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 11515 views
  • 1 like
  • 5 in conversation