Hi @Cruise SQL is indeed a great ready-meals and super convenient construct of clauses that provides fantastic utility for summary statistics aside the provision of pushing a query easily to a third party database and many others. In essence this kind of compatibility is pretty much unparalleled considering a SQL programmer minus SAS programmer would just like that use it to its full extent and be over with tasks. Therefore, Yes, your thought process is correct to question why would SQL not be robust should you consider all the advantages mentioned and I suppose I missed out many advantages too.
However, your case is kinda specific that mandates to address certain constraints.
1. Since your sample(assuming a representative sample of your real) seems a clean sorted dataset by ID and in an order presumably the observations within the ID are ordered in sequence as you indicated in your preference in picking the earliest(1st occurrence) in case of a tie.
2. 1 makes a case needing to compute a row_number to identify and choose which one is the earliest. This could be achieved by using MONOTONIC() , which is not documented, some argue it is error prone and at any rate is costly to have an extra pass of the dataset that you noticed in the sub-query.
3. Though seemingly simple and concise, there were 3 passes of the dataset, 1st to compute row_num, 2nd to determine the min(diff) and 3rd to identify and choose min of row_num from min(diff)
4. Also, it is always better to think through a datastep solution the very moment you are certain that your dataset is sorted and you know your data alignment. The GROUP BY clause performs a lot of actions i.e does an internal sort, then follows if there is a summary statistics involved and so forth. One caveat to keep in mind though is, in some operating environments, the SQL optimizer does some magic to beat Datastep solutions with a short cut plan in building an internal algorithm. This topic is complex and not needed to confuse us at this point
5. The Datastep offers better control for the user as it is actually easier and intuitive than SQL. One would argue against me here, but trust me it actually takes longer to get the hang of internal mechanisms of SQL as opposed to clearly defined sequential iteration of datastep execution i.e one by one. In SQL, you can go from row processing to column processing and back to row processing and beyond all in one SELECT CLAUSE. I deem this crazy but I believe I have gotten very thorough with this 🙂
6. So, instead of your data requiring a preliminary row_number partitioned by group, if it had a DATE variable and it it were to be an unsorted dataset and if your operating environment is highly conducive for the SQL optimizer to pick the shortest internal algorithm, my oh my! you might just win.
Hope this helps!
... View more