Re: Select Distinct from Group

jwhite · Posted 01-05-2017 02:53 PM

If my dataset has 10 columns, but I only want distinct rows selected based on the values of four of those...can it be done? I want all 10 columns in my out put, but again, select distinct only being evaulated on four of those columns.

Can this be done in an extract node in DI Studio as well?

Thanks!

Reeza · Posted 01-05-2017 03:13 PM

Yes, but you do need to specify some rules so you're sure on how the other values are selected. Assuming they aren't identical, how do you want them to be selected?

ie the first four (order doesn't matter) columns are identical, but the next one isn't. Which row will be selected? What are the rules? That will define your logic.

A B C D 10

A B C D 30

A B C D 50

jwhite · Posted 01-05-2017 04:30 PM

Let's say i want the highest value in the fifth column...where do we go from there?

I'm attampting to do this in DI Studio, but would like the proper syntax as well for understanding.

Thanks!

LaurieF · Posted 01-05-2017 05:20 PM

Is this what you want? I'm often wary of using distinct when using group by, because it can distort the results.

data source;
infile cards;
attrib col1-col4 length=$ 1;
attrib col5 length=3;
input col1-col5;
cards;
A B C D 10
A B C D 10
A B C D 30
A B C D 50
;
run;

proc sql;
create table result as
   select col1,
          col2,
          col3,
          col4,
          max(col5) as max_col5
     from source
    group by col1,
             col2,
             col3,
             col4;
quit;

Reeza · Posted 01-05-2017 05:22 PM

I'm not sure if you're using custom SQL or the GUI tools in DI Studio. I suspect the latter, so I'll leave that for someone else to answer since I don't work with DI Studio. For custom SQL, @LaurieF answer is correct, you would use an aggregate function in the SQL to select the fifth and other columns.

LaurieF · Posted 01-05-2017 05:25 PM

Both the extract and SQL queries in DI will allow you to generate code virtually identical to my hand-cut suggestion; note that extract doesn't allow the having clause.

I don't have DI here to show what it would look like, but I can talk you through it if you like. It's best for debugging purposes not to use custom SQL code - let the transformation tool do all the heavy lifting if you can get away with it.

jwhite · Posted 01-05-2017 05:34 PM

@LaurieF I know there is the option to include some custom code in DI Studio transformations. I feel pretty proficient with the basic use of the nodes, so if you've got tips on how to do this via DI Studio Extract node (or other), I'd love to see them.

Thanks!

LaurieF · Posted 01-05-2017 05:42 PM

I'm going to rudely assume that you've not used too much DI and start from scratch (sorry!). I shall also make mistakes because I'm doing it from memory… (Sorry again)

Plonk your source table in the project and drag over the SQL Extract transformation and join them one to another.

Open up the transformation and map the first four columns to the target. If they're there already (I hate auto-mapping with a passion), remove the fifth column, and any of the other columns that may have been put there autoblackmagically.

For the fifth column, make its name in the target area something meaningful. In the expression field, enter

max(col5)

In the group by tab (again I'm doing this from memory), select the first four columns.

The result will only be in a work dataset, of course, but I'll leave all the rest to you.

This should work.

Patrick · Posted 01-05-2017 09:38 PM

@jwhite

If I don't know how to do something in DIS then I normally start with working out how the code logic should look like. That's the SQL @LaurieF already provided.

The functionality of the SQL Extract transformation is a sub-set of the SQL Join transformation. It happened to me more than once that I've used the Extract transformation and then came a change where the required functionality was only available in the SQL Join transformation and I had to re-build everything. For this reason I've stopped using the Extract transformation at all and always use the Join transformation to begin with.

What you want to do is pretty basic. Here some screenshots.

Sample job:

SQL Join Node "Aggregate by name and sex":

LaurieF · Posted 01-06-2017 03:08 AM

Ah - that's the one. Thank you, @Patrick. I should have checked for that sort of example. And if only I could convince SAS Inst to give me a copy of SAS/DI…

I concur over the Extract/Join comparison. I use Extract when I know that the transformation will remain a simple one, but I much prefer the power of the join.

Kritsana · Posted 01-06-2017 03:37 AM

Let try "Sort" node to order 5 keys(#5 key is the descending value) which you mention, then use "Data Validation" node to reject duplicate row by 4 keys (select option: Move all but one row to error table).

hope it can help you. 🙂

Patrick · Posted 01-06-2017 03:53 AM

@Kritsana

What you propose could be done with the Sort node only using NODUPKEY. The disadvantage of this method is that the selected non-key variable values will be "random".

Kritsana · Posted 01-06-2017 04:18 AM

@Patrick
Thank you very much for your advised. My propose is just provided another way to get result, the reason to use Data Validation node because jwhite want to descending column #5 if there're more duplication values in 4 keys.

jwhite · Posted 01-06-2017 08:57 AM

Wow! I used the Join statement, messaged it a bit more to suit my needs, and it worked.

Can you guys expand a little more on using Join instead of Extract (when not actually joining multiple tables)? I hadn't heard of or considered this before.

Thanks!

jwhite · Posted 01-06-2017 04:35 PM

I had some extra time this afternoon and attempted to do the same thing with an Extract node...and I was.

@Patrick, @LaurieF, can you please just elaborate a little more why I should consider using joins more fequently!

Thanks and have a great weekend!

Catch up on SAS Innovate 2026