08-04-2011 05:10 PM
I hope someone can help me solving the following problem: In a macro I do a loop on a fairly large dataset (say, 100000 observations) where each iteration involves some calculations in a datastep. Say row i represents a given transaction made by a given customer with a fixed budget and that I need to calculate the remaining budget after the transaction at hand. A given customer can make sevaral transactions in which case he will be represented by other rows as well. For some reason I cannot sort the dataset by customer ID so the rows pertaining to any given customer is scattered all over the dataset. Each time a given customer shows up in the loop I need to calculate his remaning budget and transfer this value to all subsequent rows pertaining to this customer - before the next iteration.
(The real problem at hand is slightly more complicated since it involves deducting some amount from total_A and some amount from total_B, where total_A pertains to ID_A and total_B pertains to ID_B. The results from the calculations in row i then needs to be updated in all subsequent rows where these ID's show up. (The ID's do not necessarily show up at the same time, that is, the remaining ID_A's could be in row i+10 and the remaining ID_B's could be in rows i+10000 and i+50000. This needs to be done BEFORE the next row i+1 is processed in the loop. For my purpose, the dataset HAS to be sorted by other variables than any of the ID's).
To solve this problem I have done the following: Pick row i in the input dataset x and do the calculations in a datastep which produces a dataset y with one row (row i). Drop row i from the input dataset x and use proc sql to join the new values from dataset y onto the relevant remaining ID's in the reduced dataset x. Finally, append the row in dataset y to a final output dataset before reentering the loop, where the next row (row i+1) is processed in the reduced dataset x.
Having a fairly large dataset (100000+ rows) the problem with this procedure is runtime since dropping out of the datastep to make the sql join is very time consuming. Although the dataset for each loop is reduced by one row (which eventually should speed up the sql join), it nevertheless seems to be an ineffective way of coding this macro. From the outset, 1000 rows executes in approximately half an hour in real time which (if the sql join on the continually reduced dataset doesn't speed up) will total 50 hours in real time for 100000 rows. This is of course unacceptable.
Therefore: How can I acquire better performance? Is it possible to update values in subsequent rows after processing row i but before processing row i+1 within the datastep? (I suspect this is faster than leaving the datastep in each iteration to make an sql join)?
Sorry, for not providing any code but the macro is rather long and cumbersome and difficult to simplify for purpose of example.
Best regards Ulrik
08-04-2011 06:04 PM
Why can't you sort the file?
Do the records in the final output file have to be in the same order as the input file?
Can you provide some sample data, preferably in the form of a datastep?
My initial guess is that you would be best off using either a hash or multidimensional array. 1000,000 records is not very big by SAS standards. I also wonder if you really even need to do this with a macro.
08-04-2011 06:38 PM
As Art has suggested I would sort the data in customer ID order, then in a DATA step use FIRST. and LAST. processing with a BY custid statement. This will allow you to calculate all the statistics required for one customer in one go. Then I would output 2 datasets, one with all of the rows, then another containing one row for each customer triggered by: IF LAST.custid then OUTPUT statement.
Then you can take the customer-level dataset and join it back on the detail level dataset by custid to get your customer-level statistics on each row.
08-05-2011 03:53 AM
Thx for showing interest art297 and SASkiwi
I think my "dummy example" is to inadequate so let me instead introduce you to the real problem. In my data I have a lot of mortgages and a lot of loans. The mortgages is used as collateral to cover the loans, and the problem is really to optimize the usage of the collateral. A given mortgage can be used as collateral for several loans, and a loan can be collateralized by several mortgages. The data is sorted by some optimization criteria which imply that loans and mortgages (in the absence of a 1:1 relationship - i.e. when one loan is covered by only one collateral, and this collateral only covers this particular loan) will be scattered in the dataset.
Having calculated for row i the amount of the loan to be covered I need to calculate the unsecured part of the loan (if any) since this might be covered by another collateral in a later row. I also have to calculate the unused part of the collateral (if any) since this might be used to cover another loan in a later row. So to conclude, the input dataset is sorted by some optimatization criteria that effectively scatter loan_ID's and collateral_ID's. Due to the possible interdependecies between loans and collaterals I need to update both balances for loans and mortgages in all subsequent rows>i before I loop to the next row i+1 and do the calculations again.
Is there any way I can provide the code for you? (Seems like pasting the code from SAS editor, notepad or others into this reply window is impossible and typing directly the code is rather timeconsuming)
Best regards Ulrik
08-05-2011 06:48 AM
Copying and pasting from the code editor should work, albiet it won't be formatted. Alternatively, you can save in the code in a file (e.g. your existing .SAS file) and attach that to your post - you'll need to switch to the Advance Editor mode, and then there's a browse button below the text editor.
08-06-2011 04:48 AM
I found a solution that avoids using the sql join and instead uses a retain statement within the datastep. The example below is a simplied but illustrative version of the macro.
In the input dataset below there are two variables to be calculated: ref1 and ref2. Their values are given by val1 and val2, respectively. Initially, their new values val1_new and val2_new are set as val1 and val2, respectively. In the macro, I wish in each row to deduct 10 from val1 and val2, and update the new values for ref1 and ref2 in subsequent rows before continuing with the loop.
Note that the input dataset is not (cannot be for my purpose) sorted by neither ref1 nor ref2! Note also that ref1 = 1 is in row (1,4) while ref2 = 1 is in row (1,5), i.e unsorted.
Running the macro you will be able to see in the results dataset that val1 in row 4 matches the value from val1_new in row 1, and that val2 in row 5 matches the value from val2_new in row 1. This shows that the calculations in row 1 (val1_new and val2_new) have been inserted into val1 and val2, respectively, in later rows where ref1=1 and ref2=1, respectively.
input id ref1 ref2 val1 val2 val1_new val2_new 3.;
1 1 1 100 100 100 100
2 2 2 200 150 200 150
3 3 3 50 50 50 50
4 1 4 100 25 100 25
5 4 1 400 100 400 100
proc datasets nolist; delete results; run;
proc sql noprint;
select count(*) into :n
%let i = 1;
%do %while (&i. le &n);
if id = 1 then do;
val1_new = val1 - 10;
val2_new = val2 - 10;
last_ref1 = ref1;
last_ref2 = ref2;
last_val1_new = val1_new;
last_val2_new = val2_new;
if id ge 2 then do;
if ref1 = last_ref1 then val1 = last_val1_new;
if ref2 = last_ref2 then val2 = last_val2_new;
retain last_ref1 last_ref2 last_val1_new last_val2_new;
data output (drop = last_ref1 last_ref2 last_val1_new last_val2_new);
set temp (obs = 1);
data input (drop = last_ref1 last_ref2 last_val1_new last_val2_new);
set temp (where = (id ge 2));
id = _n_;
%let i = %eval(&i.+1);
proc append base = results data = output force; run;