BookmarkSubscribeRSS Feed
jwhite
Quartz | Level 8

If my dataset has 10 columns, but  I only want distinct rows selected based on the values of four of those...can it be done? I want all 10 columns in my out put, but again, select distinct only being evaulated on four of those columns.

 

Can this be done in an extract node in DI Studio as well?

 

Thanks!

19 REPLIES 19
Reeza
Super User

Yes, but you do need to specify some rules so you're sure on how the other values are selected. Assuming they aren't identical, how do you want them to be selected?

 

ie the first four (order doesn't matter) columns are identical, but the next one isn't. Which row will be selected? What are the rules? That will define your logic.

 

A B C D 10

A B C D 10

A B C D 30

A B C D 50

 

jwhite
Quartz | Level 8

Let's say i want the highest value in the fifth column...where do we go from there? 

 

I'm attampting to do this in DI Studio, but would like the proper syntax as well for understanding.

 

Thanks!

 

LaurieF
Barite | Level 11

Is this what you want? I'm often wary of using distinct when using group by, because it can distort the results.

 

data source;
infile cards;
attrib col1-col4 length=$ 1;
attrib col5 length=3;
input col1-col5;
cards;
A B C D 10
A B C D 10
A B C D 30
A B C D 50
;
run;

proc sql;
create table result as
   select col1,
          col2,
          col3,
          col4,
          max(col5) as max_col5
     from source
    group by col1,
             col2,
             col3,
             col4;
quit;
Reeza
Super User

I'm not sure if you're using custom SQL or the GUI tools in DI Studio. I suspect the latter, so I'll leave that for someone else to answer since I don't work with DI Studio. For custom SQL, @LaurieF answer is correct, you would use an aggregate function in the SQL to select the fifth and other columns. 

LaurieF
Barite | Level 11

Both the extract and SQL queries in DI will allow you to generate code virtually identical to my hand-cut suggestion; note that extract doesn't allow the having clause.

 

I don't have DI here to show what it would look like, but I can talk you through it if you like. It's best for debugging purposes not to use custom SQL code - let the transformation tool do all the heavy lifting if you can get away with it.

jwhite
Quartz | Level 8

@LaurieF  I know there is the option to include some custom code in DI Studio transformations. I feel pretty proficient with the basic use of the nodes, so if you've got tips on how to do this via DI Studio Extract node (or other), I'd love to see them.

 

Thanks!

LaurieF
Barite | Level 11

I'm going to rudely assume that you've not used too much DI and start from scratch (sorry!). I shall also make mistakes because I'm doing it from memory… (Sorry again)

 

Plonk your source table in the project and drag over the SQL Extract transformation and join them one to another.

 

Open up the transformation and map the first four columns to the target. If they're there already (I hate auto-mapping with a passion), remove the fifth column, and any of the other columns that may have been put there autoblackmagically.

 

For the fifth column, make its name in the target area something meaningful. In the expression field, enter 

max(col5)

 

In the group by tab (again I'm doing this from memory), select the first four columns.

 

The result will only be in a work dataset, of course, but I'll leave all the rest to you.

 

This should work. 

Patrick
Opal | Level 21

@jwhite

If I don't know how to do something in DIS then I normally start with working out how the code logic should look like. That's the SQL @LaurieF already provided.

 

The functionality of the SQL Extract transformation is a sub-set of the SQL Join transformation. It happened to me more than once that I've used the Extract transformation and then came a change where the required functionality was only available in the SQL Join transformation and I had to re-build everything. For this reason I've stopped using the Extract transformation at all and always use the Join transformation to begin with.

 

What you want to do is pretty basic. Here some screenshots.

 

Sample job:

Capture.PNG

 

SQL Join Node "Aggregate by name and sex":

Capture.PNG

 

Capture.PNG

 

Capture.PNG

 

 

 

LaurieF
Barite | Level 11

Ah - that's the one. Thank you, @Patrick. I should have checked for that sort of example. And if only I could convince SAS Inst to give me a copy of SAS/DI…

 

I concur over the Extract/Join comparison. I use Extract when I know that the transformation will remain a simple one, but I much prefer the power of the join.

Kritsana
Calcite | Level 5

Let try "Sort" node to order 5 keys(#5 key is the descending value) which you mention, then use "Data Validation" node to reject duplicate row by 4 keys (select option: Move all but one row to error table). 

hope it can help you. 🙂

Patrick
Opal | Level 21

@Kritsana

What you propose could be done with the Sort node only using NODUPKEY. The disadvantage of this method is that the selected non-key variable values will be "random".  

Kritsana
Calcite | Level 5

@Patrick
Thank you very much for your advised. My propose is just provided another way to get result, the reason to use Data Validation node because jwhite want to descending column #5 if there're more duplication values in 4 keys.

jwhite
Quartz | Level 8

Wow! I used the Join statement, messaged it a bit more to suit my needs, and it worked.

 

Can you guys expand a little more on using Join instead of Extract (when not actually joining multiple tables)? I hadn't heard of or considered this before.

 

Thanks!

jwhite
Quartz | Level 8

I had some extra time this afternoon and attempted to do the same thing with an Extract node...and I was.

 

@Patrick, @LaurieF, can you please just elaborate a little more why I should consider using joins more fequently!

 

Thanks and have a great weekend!

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 19 replies
  • 7150 views
  • 1 like
  • 7 in conversation