01-05-2017 02:53 PM - edited 01-05-2017 03:02 PM
If my dataset has 10 columns, but I only want distinct rows selected based on the values of four of those...can it be done? I want all 10 columns in my out put, but again, select distinct only being evaulated on four of those columns.
Can this be done in an extract node in DI Studio as well?
01-05-2017 03:13 PM
Yes, but you do need to specify some rules so you're sure on how the other values are selected. Assuming they aren't identical, how do you want them to be selected?
ie the first four (order doesn't matter) columns are identical, but the next one isn't. Which row will be selected? What are the rules? That will define your logic.
A B C D 10
A B C D 10
A B C D 30
A B C D 50
01-05-2017 04:30 PM
Let's say i want the highest value in the fifth column...where do we go from there?
I'm attampting to do this in DI Studio, but would like the proper syntax as well for understanding.
01-05-2017 05:20 PM
Is this what you want? I'm often wary of using distinct when using group by, because it can distort the results.
data source; infile cards; attrib col1-col4 length=$ 1; attrib col5 length=3; input col1-col5; cards; A B C D 10 A B C D 10 A B C D 30 A B C D 50 ; run; proc sql; create table result as select col1, col2, col3, col4, max(col5) as max_col5 from source group by col1, col2, col3, col4; quit;
01-05-2017 05:22 PM
I'm not sure if you're using custom SQL or the GUI tools in DI Studio. I suspect the latter, so I'll leave that for someone else to answer since I don't work with DI Studio. For custom SQL, @LaurieF answer is correct, you would use an aggregate function in the SQL to select the fifth and other columns.
01-05-2017 05:25 PM
Both the extract and SQL queries in DI will allow you to generate code virtually identical to my hand-cut suggestion; note that extract doesn't allow the having clause.
I don't have DI here to show what it would look like, but I can talk you through it if you like. It's best for debugging purposes not to use custom SQL code - let the transformation tool do all the heavy lifting if you can get away with it.
01-05-2017 05:34 PM
@LaurieF I know there is the option to include some custom code in DI Studio transformations. I feel pretty proficient with the basic use of the nodes, so if you've got tips on how to do this via DI Studio Extract node (or other), I'd love to see them.
01-05-2017 05:42 PM
I'm going to rudely assume that you've not used too much DI and start from scratch (sorry!). I shall also make mistakes because I'm doing it from memory… (Sorry again)
Plonk your source table in the project and drag over the SQL Extract transformation and join them one to another.
Open up the transformation and map the first four columns to the target. If they're there already (I hate auto-mapping with a passion), remove the fifth column, and any of the other columns that may have been put there autoblackmagically.
For the fifth column, make its name in the target area something meaningful. In the expression field, enter
In the group by tab (again I'm doing this from memory), select the first four columns.
The result will only be in a work dataset, of course, but I'll leave all the rest to you.
This should work.
01-05-2017 09:38 PM
If I don't know how to do something in DIS then I normally start with working out how the code logic should look like. That's the SQL @LaurieF already provided.
The functionality of the SQL Extract transformation is a sub-set of the SQL Join transformation. It happened to me more than once that I've used the Extract transformation and then came a change where the required functionality was only available in the SQL Join transformation and I had to re-build everything. For this reason I've stopped using the Extract transformation at all and always use the Join transformation to begin with.
What you want to do is pretty basic. Here some screenshots.
SQL Join Node "Aggregate by name and sex":
01-06-2017 03:08 AM
Ah - that's the one. Thank you, @Patrick. I should have checked for that sort of example. And if only I could convince SAS Inst to give me a copy of SAS/DI…
I concur over the Extract/Join comparison. I use Extract when I know that the transformation will remain a simple one, but I much prefer the power of the join.
01-06-2017 03:37 AM - edited 01-06-2017 04:37 AM
Let try "Sort" node to order 5 keys(#5 key is the descending value) which you mention, then use "Data Validation" node to reject duplicate row by 4 keys (select option: Move all but one row to error table).
hope it can help you.
01-06-2017 03:53 AM
What you propose could be done with the Sort node only using NODUPKEY. The disadvantage of this method is that the selected non-key variable values will be "random".
01-06-2017 04:18 AM - edited 01-06-2017 04:35 AM
Thank you very much for your advised. My propose is just provided another way to get result, the reason to use Data Validation node because jwhite want to descending column #5 if there're more duplication values in 4 keys.
01-06-2017 08:57 AM - edited 01-06-2017 10:49 AM
Wow! I used the Join statement, messaged it a bit more to suit my needs, and it worked.
Can you guys expand a little more on using Join instead of Extract (when not actually joining multiple tables)? I hadn't heard of or considered this before.
01-06-2017 04:35 PM