Hi folks:
I'm trying to select the row with the min(diff) and if rows ties on diff like it occurs for the ID=3 patient then select the first row.
How to achieve it? my code below doesn't help when rows tied on the 'diff'.
Thanks in advance
data temp;
input id diff;
cards;
1 2
1 3
2 1
2 2
3 2
3 2
;
proc sql;
select *
from temp
group by CTC_id
having DIFF eq min(DIFF)
;
quit;
And the Datastep:
data want;
do _n_=1 by 1 until(last.id);
set temp;
by id;
if min>. and diff>=min then continue;
min=diff;
k=_n_;
end;
do _n_=1 to _n_;
set temp;
if _n_=k then output;
end;
drop k min;
run;
Use a data step and FIRST with a BY statement instead?
Or two proc sorts.
proc sort data=temp;
by id diff;
run;
data want1;
set temp;
by id;
if first.id;
run;
proc sort data=temp;
by id diff;
run;
proc sort data=temp out=want2 NODUPKEY;
by ID;
run;
I am gonna jump to ameeting, but wait for a datastep from somebody else as sql is not robust but until then some fun
proc sql;
create table want(drop=m) as
select *
from
(select *
from (select *,monotonic() as m from temp)
group by id
having DIFF eq min(DIFF))
group by id,diff
having min(m)=m;
;
quit;
Too tired and about to doze off. Will explain tomorrow without fail
Hi @Cruise SQL is indeed a great ready-meals and super convenient construct of clauses that provides fantastic utility for summary statistics aside the provision of pushing a query easily to a third party database and many others. In essence this kind of compatibility is pretty much unparalleled considering a SQL programmer minus SAS programmer would just like that use it to its full extent and be over with tasks. Therefore, Yes, your thought process is correct to question why would SQL not be robust should you consider all the advantages mentioned and I suppose I missed out many advantages too.
However, your case is kinda specific that mandates to address certain constraints.
1. Since your sample(assuming a representative sample of your real) seems a clean sorted dataset by ID and in an order presumably the observations within the ID are ordered in sequence as you indicated in your preference in picking the earliest(1st occurrence) in case of a tie.
2. 1 makes a case needing to compute a row_number to identify and choose which one is the earliest. This could be achieved by using MONOTONIC() , which is not documented, some argue it is error prone and at any rate is costly to have an extra pass of the dataset that you noticed in the sub-query.
3. Though seemingly simple and concise, there were 3 passes of the dataset, 1st to compute row_num, 2nd to determine the min(diff) and 3rd to identify and choose min of row_num from min(diff)
4. Also, it is always better to think through a datastep solution the very moment you are certain that your dataset is sorted and you know your data alignment. The GROUP BY clause performs a lot of actions i.e does an internal sort, then follows if there is a summary statistics involved and so forth. One caveat to keep in mind though is, in some operating environments, the SQL optimizer does some magic to beat Datastep solutions with a short cut plan in building an internal algorithm. This topic is complex and not needed to confuse us at this point
5. The Datastep offers better control for the user as it is actually easier and intuitive than SQL. One would argue against me here, but trust me it actually takes longer to get the hang of internal mechanisms of SQL as opposed to clearly defined sequential iteration of datastep execution i.e one by one. In SQL, you can go from row processing to column processing and back to row processing and beyond all in one SELECT CLAUSE. I deem this crazy but I believe I have gotten very thorough with this 🙂
6. So, instead of your data requiring a preliminary row_number partitioned by group, if it had a DATE variable and it it were to be an unsorted dataset and if your operating environment is highly conducive for the SQL optimizer to pick the shortest internal algorithm, my oh my! you might just win.
Hope this helps!
And the Datastep:
data want;
do _n_=1 by 1 until(last.id);
set temp;
by id;
if min>. and diff>=min then continue;
min=diff;
k=_n_;
end;
do _n_=1 to _n_;
set temp;
if _n_=k then output;
end;
drop k min;
run;
@novinosrin , perfect use case of double-DoW-loop.
All the best
Bart
@yabwon Dziękuję bardzo and the same to you in my native Tamil மிக்க நன்றி. BTW, if you ever meet somebody that "somebody" by the name Mr. Marcin from Krakow(I believe) asked me to learn something called SAS in catholic church, Wolverhampton over a glass of red wine, say my heartfelt thanks. I can't remember him nor I know him but he recommended this pubaholic forklift driver with no education whatsoever to learn SAS believing that I would do well in that one meeting instance. Haha :), seems partially true. God Bless!
Wow! How cool it is that your "SAS roots" are Polish! 😉 If we ever have chance to meet face-2-face I have to hear that story from you in more details (maybe over a glass of wine too, maybe at SGF2020?)
All the best
Bart
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.