SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Select Distinct from Group

Reply
Frequent Contributor
Posts: 88

Select Distinct from Group

[ Edited ]

If my dataset has 10 columns, but  I only want distinct rows selected based on the values of four of those...can it be done? I want all 10 columns in my out put, but again, select distinct only being evaulated on four of those columns.

 

Can this be done in an extract node in DI Studio as well?

 

Thanks!

Super User
Posts: 17,818

Re: Select Distinct from Group

Yes, but you do need to specify some rules so you're sure on how the other values are selected. Assuming they aren't identical, how do you want them to be selected?

 

ie the first four (order doesn't matter) columns are identical, but the next one isn't. Which row will be selected? What are the rules? That will define your logic.

 

A B C D 10

A B C D 10

A B C D 30

A B C D 50

 

Frequent Contributor
Posts: 88

Re: Select Distinct from Group

Let's say i want the highest value in the fifth column...where do we go from there? 

 

I'm attampting to do this in DI Studio, but would like the proper syntax as well for understanding.

 

Thanks!

 

Super Contributor
Posts: 251

Re: Select Distinct from Group

Is this what you want? I'm often wary of using distinct when using group by, because it can distort the results.

 

data source;
infile cards;
attrib col1-col4 length=$ 1;
attrib col5 length=3;
input col1-col5;
cards;
A B C D 10
A B C D 10
A B C D 30
A B C D 50
;
run;

proc sql;
create table result as
   select col1,
          col2,
          col3,
          col4,
          max(col5) as max_col5
     from source
    group by col1,
             col2,
             col3,
             col4;
quit;
Super User
Posts: 17,818

Re: Select Distinct from Group

I'm not sure if you're using custom SQL or the GUI tools in DI Studio. I suspect the latter, so I'll leave that for someone else to answer since I don't work with DI Studio. For custom SQL, @LaurieF answer is correct, you would use an aggregate function in the SQL to select the fifth and other columns. 

Super Contributor
Posts: 251

Re: Select Distinct from Group

Both the extract and SQL queries in DI will allow you to generate code virtually identical to my hand-cut suggestion; note that extract doesn't allow the having clause.

 

I don't have DI here to show what it would look like, but I can talk you through it if you like. It's best for debugging purposes not to use custom SQL code - let the transformation tool do all the heavy lifting if you can get away with it.

Frequent Contributor
Posts: 88

Re: Select Distinct from Group

@LaurieF  I know there is the option to include some custom code in DI Studio transformations. I feel pretty proficient with the basic use of the nodes, so if you've got tips on how to do this via DI Studio Extract node (or other), I'd love to see them.

 

Thanks!

Super Contributor
Posts: 251

Re: Select Distinct from Group

I'm going to rudely assume that you've not used too much DI and start from scratch (sorry!). I shall also make mistakes because I'm doing it from memory… (Sorry again)

 

Plonk your source table in the project and drag over the SQL Extract transformation and join them one to another.

 

Open up the transformation and map the first four columns to the target. If they're there already (I hate auto-mapping with a passion), remove the fifth column, and any of the other columns that may have been put there autoblackmagically.

 

For the fifth column, make its name in the target area something meaningful. In the expression field, enter 

max(col5)

 

In the group by tab (again I'm doing this from memory), select the first four columns.

 

The result will only be in a work dataset, of course, but I'll leave all the rest to you.

 

This should work. 

Respected Advisor
Posts: 3,887

Re: Select Distinct from Group

@jwhite

If I don't know how to do something in DIS then I normally start with working out how the code logic should look like. That's the SQL @LaurieF already provided.

 

The functionality of the SQL Extract transformation is a sub-set of the SQL Join transformation. It happened to me more than once that I've used the Extract transformation and then came a change where the required functionality was only available in the SQL Join transformation and I had to re-build everything. For this reason I've stopped using the Extract transformation at all and always use the Join transformation to begin with.

 

What you want to do is pretty basic. Here some screenshots.

 

Sample job:

Capture.PNG

 

SQL Join Node "Aggregate by name and sex":

Capture.PNG

 

Capture.PNG

 

Capture.PNG

 

 

 

Super Contributor
Posts: 251

Re: Select Distinct from Group

Ah - that's the one. Thank you, @Patrick. I should have checked for that sort of example. And if only I could convince SAS Inst to give me a copy of SAS/DI…

 

I concur over the Extract/Join comparison. I use Extract when I know that the transformation will remain a simple one, but I much prefer the power of the join.

New Contributor
Posts: 2

Re: Select Distinct from Group

[ Edited ]

Let try "Sort" node to order 5 keys(#5 key is the descending value) which you mention, then use "Data Validation" node to reject duplicate row by 4 keys (select option: Move all but one row to error table). 

hope it can help you. Smiley Happy

Respected Advisor
Posts: 3,887

Re: Select Distinct from Group

@Kritsana

What you propose could be done with the Sort node only using NODUPKEY. The disadvantage of this method is that the selected non-key variable values will be "random".  

New Contributor
Posts: 2

Re: Select Distinct from Group

[ Edited ]

@Patrick
Thank you very much for your advised. My propose is just provided another way to get result, the reason to use Data Validation node because jwhite want to descending column #5 if there're more duplication values in 4 keys.

Frequent Contributor
Posts: 88

Re: Select Distinct from Group

[ Edited ]

Wow! I used the Join statement, messaged it a bit more to suit my needs, and it worked.

 

Can you guys expand a little more on using Join instead of Extract (when not actually joining multiple tables)? I hadn't heard of or considered this before.

 

Thanks!

Frequent Contributor
Posts: 88

Re: Select Distinct from Group

I had some extra time this afternoon and attempted to do the same thing with an Extract node...and I was.

 

@Patrick, @LaurieF, can you please just elaborate a little more why I should consider using joins more fequently!

 

Thanks and have a great weekend!

 

 

Ask a Question
Discussion stats
  • 19 replies
  • 900 views
  • 1 like
  • 7 in conversation