Thanks for the kind words.
WRT the essence of the task, the principal issue is the programming algorithm. It goes more or less like this:
As you can see, which kind of lookup table to choose is a matter of the programmer's discretion and/or familiarity with different SAS tools. Arrays items are naturally accessed by index (in this case, _i) and easy to update directly in place. Alternatively, a hash table can be used, whose advantage is that, unlike with an array, you needn't muck around determining its maximum size beforehand since it grows automatically when the next item is added. The disadvantage is that you cannot update a data portion variable directly in the table: Instead, you need to first extract its value into its PDV host variable (by calling the FIND method), set it to the desired value, and then update the table (by calling the REPLACE method). Or, in the SAS language:
data want (keep = sg_a key_a key_b balance_extra actual_transfer) ;
if _n_ = 1 then do ;
dcl hash h () ;
h.definekey ("_i") ;
h.definedata ("extra", "key_a") ;
h.definedone () ;
end ;
do _i = 1 by 1 until (last.sg_a) ;
set table_a (in = _a) table_b (rename = sg_b = sg_a);
by sg_a ;
if _a then h.add() ;
else do _i = 1 by 0 until (short = 0 or _i > h.num_items) ;
h.find() ;
actual_transfer = extra min short ;
if actual_transfer > 0 then do ;
balance_extra = extra - actual_transfer ;
output ;
end ;
extra +- actual_transfer ;
short +- actual_transfer ;
h.replace() ;
if extra = 0 then _i + 1 ;
end ;
end ;
h.clear() ;
run ;
Note that with the array solution, we needn't clear the array content before each next BY group because when it is loaded from it, its _iM first items get overwritten, and the further balance transfer processing occurs only up to _iM.
Kind regards
Paul M.
Thank for another tips - Data step debugger as this is first time I hear this, just have a googling on this and this is way better visual than to remove the noprint and cancel those drop to understand the flow.
Deepest apologies for late reply, when I test the first solution(array), you have already share the 2nd solution(hash method) and even further explain the logical idea of solving this puzzle. The array was work perfectly, I will test the 2nd solution tomorrow and will compare the time consume for both solution. Then I will share the time spend for each solution and to confirm the best solution(actually not sure if i doing the fair comparison, thinking of using the same real data and just compare the time spend).
Best Regards
My deepest apologies for delay as I couldn't make it last Friday before I have my long leave. Below is my tested results,
1486 observations read from the data set WORK.TABLE_A.
1007 observations read from the data set WORK.TABLE_B.
Total of 2493 observations only.
Hash Method:
NOTE: There were 1486 observations read from the data set WORK.TABLE_A.
NOTE: There were 1007 observations read from the data set WORK.TABLE_B.
NOTE: The data set WORK.WANT has 1085 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
Array Method:
NOTE: There were 1486 observations read from the data set WORK.TABLE_A.
NOTE: There were 1007 observations read from the data set WORK.TABLE_B.
NOTE: The data set WORK.WANT has 1085 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
From the above time, I can say the array method is faster than Hash method. However, i wish i can choose both as solution because both also produce the same observation numbers(sorry didn't have time to check all),
Hope this will benefit others also, I'm also welcome others expert to provide other solution and comments as well.
Thank again for both your time.
As noted earlier, I'd expect the array to be faster than the hash - in this particular situation and for the reasons already stated.
However, truth be told, a real speed comparison test should involve more than a thousand or so records. When both methods being compared run under 0.03 seconds, it's difficult to attribute the difference between the run times shown in the log to the methods themselves rather than to a myriad of always existing confounding factors (such as other processes running on the same machine concurrently with SAS).
Kind regards
Paul D.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.