If my dataset has 10 columns, but I only want distinct rows selected based on the values of four of those...can it be done? I want all 10 columns in my out put, but again, select distinct only being evaulated on four of those columns.
Can this be done in an extract node in DI Studio as well?
Thanks!
Yes, but you do need to specify some rules so you're sure on how the other values are selected. Assuming they aren't identical, how do you want them to be selected?
ie the first four (order doesn't matter) columns are identical, but the next one isn't. Which row will be selected? What are the rules? That will define your logic.
A B C D 10
A B C D 10
A B C D 30
A B C D 50
Let's say i want the highest value in the fifth column...where do we go from there?
I'm attampting to do this in DI Studio, but would like the proper syntax as well for understanding.
Thanks!
Is this what you want? I'm often wary of using distinct when using group by, because it can distort the results.
data source;
infile cards;
attrib col1-col4 length=$ 1;
attrib col5 length=3;
input col1-col5;
cards;
A B C D 10
A B C D 10
A B C D 30
A B C D 50
;
run;
proc sql;
create table result as
select col1,
col2,
col3,
col4,
max(col5) as max_col5
from source
group by col1,
col2,
col3,
col4;
quit;
I'm not sure if you're using custom SQL or the GUI tools in DI Studio. I suspect the latter, so I'll leave that for someone else to answer since I don't work with DI Studio. For custom SQL, @LaurieF answer is correct, you would use an aggregate function in the SQL to select the fifth and other columns.
Both the extract and SQL queries in DI will allow you to generate code virtually identical to my hand-cut suggestion; note that extract doesn't allow the having clause.
I don't have DI here to show what it would look like, but I can talk you through it if you like. It's best for debugging purposes not to use custom SQL code - let the transformation tool do all the heavy lifting if you can get away with it.
@LaurieF I know there is the option to include some custom code in DI Studio transformations. I feel pretty proficient with the basic use of the nodes, so if you've got tips on how to do this via DI Studio Extract node (or other), I'd love to see them.
Thanks!
I'm going to rudely assume that you've not used too much DI and start from scratch (sorry!). I shall also make mistakes because I'm doing it from memory… (Sorry again)
Plonk your source table in the project and drag over the SQL Extract transformation and join them one to another.
Open up the transformation and map the first four columns to the target. If they're there already (I hate auto-mapping with a passion), remove the fifth column, and any of the other columns that may have been put there autoblackmagically.
For the fifth column, make its name in the target area something meaningful. In the expression field, enter
max(col5)
In the group by tab (again I'm doing this from memory), select the first four columns.
The result will only be in a work dataset, of course, but I'll leave all the rest to you.
This should work.
If I don't know how to do something in DIS then I normally start with working out how the code logic should look like. That's the SQL @LaurieF already provided.
The functionality of the SQL Extract transformation is a sub-set of the SQL Join transformation. It happened to me more than once that I've used the Extract transformation and then came a change where the required functionality was only available in the SQL Join transformation and I had to re-build everything. For this reason I've stopped using the Extract transformation at all and always use the Join transformation to begin with.
What you want to do is pretty basic. Here some screenshots.
Sample job:
SQL Join Node "Aggregate by name and sex":
Ah - that's the one. Thank you, @Patrick. I should have checked for that sort of example. And if only I could convince SAS Inst to give me a copy of SAS/DI…
I concur over the Extract/Join comparison. I use Extract when I know that the transformation will remain a simple one, but I much prefer the power of the join.
Let try "Sort" node to order 5 keys(#5 key is the descending value) which you mention, then use "Data Validation" node to reject duplicate row by 4 keys (select option: Move all but one row to error table).
hope it can help you. 🙂
What you propose could be done with the Sort node only using NODUPKEY. The disadvantage of this method is that the selected non-key variable values will be "random".
@Patrick
Thank you very much for your advised. My propose is just provided another way to get result, the reason to use Data Validation node because jwhite want to descending column #5 if there're more duplication values in 4 keys.
Wow! I used the Join statement, messaged it a bit more to suit my needs, and it worked.
Can you guys expand a little more on using Join instead of Extract (when not actually joining multiple tables)? I hadn't heard of or considered this before.
Thanks!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.