BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
hashman
Ammonite | Level 13

@sagulolo :

Thanks for the kind words.

 

WRT the essence of the task, the principal issue is the programming algorithm. It goes more or less like this:

  1. Read the next (i.e. in the beginning, the first) BY group from Table_A 1 record at a time, giving each the next value of _i+1. If no records in the file are left to read, STOP.
  2. Store EXTRA and KEY_A from each record in (some sort of) lookup table in memory whose items can be (i) accessed by the item number as a key and (ii) updated using the item number as a key. This key is simply the sequential number _i.
  3. Read the next (i.e. at the beginning, the first) BY group from Table_B. If no records in the file are left to read, STOP. For each record from this BY group:
  4. Set _i to 1 and extract the item with _i=1 from the lookup table.
  5. Compute actual_transfer as the minimum between SHORT and the value of EXTRA extracted from the table.
  6. If actual_transfer > 0 then compute balance_extra = extra - actual_transfer, then output.
  7. Subtract actual_transfer from EXTRA, then subtract actual_transfer from SHORT.
  8. Update the table for the current value of _i with the new value of EXTRA, keeping KEY_A in the table intact.
  9. If EXTRA=0, then we need to grab the next item from the table, so increment _i by 1. Then go to 5.
  10. After the last record from Table_B has been processed, clear the lookup table (unless measures are taken to overwrite its items).

 

As you can see, which kind of lookup table to choose is a matter of the programmer's discretion and/or familiarity with different SAS tools. Arrays items are naturally accessed by index (in this case, _i) and easy to update directly in place. Alternatively, a hash table can be used, whose advantage is that, unlike with an array, you needn't muck around determining its maximum size beforehand since it grows automatically when the next item is added. The disadvantage is that you cannot update a data portion variable directly in the table: Instead, you need to first extract its value into its PDV host variable (by calling the FIND method), set it to the desired value, and then update the table (by calling the REPLACE method). Or, in the SAS language:

data want (keep = sg_a key_a key_b balance_extra actual_transfer) ;
  if _n_ = 1 then do ;                                             
    dcl hash h () ;                                                
    h.definekey ("_i") ;                                           
    h.definedata ("extra", "key_a") ;                              
    h.definedone () ;                                              
  end ;                                                            
  do _i = 1 by 1 until (last.sg_a) ;                               
    set table_a (in = _a) table_b (rename = sg_b = sg_a);          
    by sg_a ;                                                      
    if _a then h.add() ;                                           
    else do _i = 1 by 0 until (short = 0 or _i > h.num_items) ;    
      h.find() ;                                                   
      actual_transfer = extra min short ;                          
      if actual_transfer > 0 then do ;                             
        balance_extra = extra - actual_transfer ;                  
        output ;                                                   
      end ;                                                        
      extra +- actual_transfer ;                                   
      short +- actual_transfer ;                                   
      h.replace() ;                                                
      if extra = 0 then _i + 1 ;                                   
    end ;                                                          
  end ;                                                            
  h.clear() ;                                                      
run ;                                                              

Note that with the array solution, we needn't clear the array content before each next BY group because when it is loaded from it, its _iM first items get overwritten, and the further balance transfer processing occurs only up to _iM.

 

Kind regards

Paul M.   

 

  

sagulolo
Quartz | Level 8

@KachiM 

 

Thank for another tips - Data step debugger as this is first time I hear this, just have a googling on this and this is way better visual than to remove the noprint and cancel those drop to understand the flow.

 

@hashman 

 

Deepest apologies for late reply, when I test the first solution(array), you have already share the 2nd solution(hash method) and even further explain the logical idea of solving this puzzle. The array was work perfectly, I will test the 2nd solution tomorrow and will compare the time consume for both solution. Then I will share the time spend for each solution and to confirm the best solution(actually not sure if i doing the fair comparison, thinking of using the same real data and just compare the time spend).

 

Best Regards

sagulolo
Quartz | Level 8

Dear @hashman  @KachiM ,

 

My deepest apologies for delay as I couldn't make it last Friday before I have my long leave. Below is my tested results, 

 

1486 observations read from the data set WORK.TABLE_A.
1007 observations read from the data set WORK.TABLE_B.

Total of 2493 observations only.

 

Hash Method:

 

NOTE: There were 1486 observations read from the data set WORK.TABLE_A.
NOTE: There were 1007 observations read from the data set WORK.TABLE_B.
NOTE: The data set WORK.WANT has 1085 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds

 

Array Method:

 

NOTE: There were 1486 observations read from the data set WORK.TABLE_A.
NOTE: There were 1007 observations read from the data set WORK.TABLE_B.
NOTE: The data set WORK.WANT has 1085 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

 

From the above time, I can say the array method is faster than Hash method. However, i wish i can choose both as solution because both also produce the same observation numbers(sorry didn't have time to check all), 

 

Hope this will benefit others also, I'm also welcome others expert to provide other solution and comments as well.

 

Thank again for both your time. 

 

 

hashman
Ammonite | Level 13

@sagulolo:

As noted earlier, I'd expect the array to be faster than the hash - in this particular situation and for the reasons already stated.

However, truth be told, a real speed comparison test should involve more than a thousand or so records. When both methods being compared run under 0.03 seconds, it's difficult to attribute the difference between the run times shown in the log to the methods themselves rather than to a myriad of always existing confounding factors (such as other processes running on the same machine concurrently with SAS).

 

Kind regards

Paul D.  

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 18 replies
  • 3337 views
  • 6 likes
  • 3 in conversation