Hi Tom. I took the data and re-created the table in excel to get the values and formulas to derive the calculated fields. I needed to create helper fields such as NumberDays1 and BegAmount1 to get the final results. In the code below, the grouping is necessary so that the calculated NumberDays1 stops decreasing when their is a change in NumberDays as this distiguishes the different froup within ID. Thanks.
data have; format ID ShowAmount NumberDays NumberDaysMonth NumberDaysBeg NumberDaysEnd BegAmount best12.; infile datalines4 dlm='7F'x missover dsd; input ID ShowAmount NumberDays NumberDaysMonth NumberDaysBeg NumberDaysEnd BegAmount : best32.; datalines4; 152 . 36 28 0 28 3650 152 . 36 8 0 8 36500 152 . 85 14 0 14 4100 152 . 85 31 0 31 4100 152 700 85 30 10 20 4100 152 . 85 10 0 10 41000 152 . 71 19 0 19 4650 152 450 71 29 28 1 4650 152 . 71 23 0 23 4650 152 . 85 25 0 25 4650 152 . 85 36 0 31 4650 152 . 85 29 0 29 4650 ;;;; data have; set have; if ShowAmount = . then ShowAmount = 0; run; data want; set have;
by ID NumberDays notsorted;
if first.ID then do;
BegAmount1 = BegAmount;
EndAmount = (BegAmount + ShowAmount);
NumberDays1 = NumberDays;
MonthlyAmount = (((NumberDaysBeg/NumberDays1)*BegAmount1) + ((NumberDaysEnd/NumberDays1)*EndAmount));
EndAmount1 = (EndAmount - MonthlyAmount);
end;
if not first.ID then do;
BegAmount1 = BegAmount - lag(BegAmount) + lag(EndAmount1);
EndAmount = (BegAmount - lag(BegAmount) + ShowAmount + lag(EndAmount1));
NumberDays1 = lag(NumberDays1) - lag(NumberDaysMonth);
MonthlyAmount = (((NumberDaysBeg/NumberDays1)*BegAmount1) + ((NumberDaysEnd/NumberDays1)*EndAmount));
EndAount1 = (EndAmount - MonthlyAmount);
end;
run;
Want: ID ShowAmount NumberDays NumberDays1 NumberDaysMonth NumberDaysBeg NumberDaysEnd BegAmount BegAmount1 EndAmount MonthlyAmount EndAmount1 152 0 36 36 28 0 28 3650 3650 3650 2839 811 152 0 36 8 8 0 8 3650 811 811 811 0 152 0 85 85 14 0 14 4100 450 450 74 376 152 0 85 71 31 0 31 4100 376 376 164 212 152 700 85 40 30 10 20 4100 212 912 509 403 152 0 85 10 10 0 10 4100 403 403 403 0 152 0 71 71 19 0 19 4650 550 550 147 403 152 450 71 52 29 28 1 4650 403 853 233 620 152 0 71 23 23 0 23 4650 620 620 620 0 152 0 85 85 25 0 25 4650 0 0 0 0 152 0 85 60 31 0 31 4650 0 0 0 0 152 0 85 29 29 0 29 4650 0 0 0 0
If every other value is missing the that means the values the variable had when you called it also was alternating between missing and non-missing.
Here is simple example of how LAG() works.
71 data test; 72 set sashelp.class (obs=4); 73 lag_name=lag(name); 74 put (_n_ name lag_name) (=); 75 run; _N_=1 Name=Alfred lag_name= _N_=2 Name=Alice lag_name=Alfred _N_=3 Name=Barbara lag_name=Alice _N_=4 Name=Carol lag_name=Barbara
On first iteration (_N_=1) the value of ALFRED is saved and missing is returned.
Then on second (_N_=2) the value of ALICE is saves and ALFRED is returned.
etc.
I'm not understanding what you're trying to do, but I suspect part of the problem is that you are reading a variable TOTAL from have, and also trying to create a variable TOTAL in the datastep and use it as an accumulator (retaining across observations and incrementing it). That won't work, because every time you read a record from HAVE, it will over-write your retained value. You're probably better off creating a new variable TOTAL2 to use as an accumulator, to avoid the collision.
To edit look for the vertical ... under your user icon next to the top of the message. Clicking on that icon opens the menu that has Edit.
I don't follow the logic. Perhaps I am just not able to figure out what is going on because I am distracted by not understanding the data structure.
Are you trying to use BY ID NUMBERDAYS to mean that the first two rows are both in the same group somehow? A group that lasts 38 days. 28 for the first observation an 8 for the second. How does that impact the logic of the calculations? Or does it? You have a BY statement but you aren't referencing and of the FIRST. or LAST. variables that SAS will generate because of the BY statement.
Also why do you have BY ID when there is only one value of ID in the example data? Does your real data have multiple values of ID?
Hi Tom....Yes you are right. The first two rows are in the same group that last 36 days (28 + 8). To explain the logic, for the first row in this group, the BegAmount is 3650. Since there is no new revenue for ShowAmount, that would mean that the EndAmount is 3650 as well. The calculation of MonthlyAmount based on the BegAmount and EndAmount is 2839 and is calculated as a weighted average of the BegAmount and EndAmount values. This would mean that the new EndAmount after the calculation of the MonthlyAmount is actually 811 (3650 - 2839) which is EndAmount1.
For the second row, BegAmount becomes EndAmount1 which is 811 as this is a carryover from the previous row. Since their is no new revenue from Show Amount to add to obtain the EndAmount, the EndAmount = 811. Once again, the MonthlyAmount is calculated as 811. This should return a result of EndAmount1 = 0 (811 - 811), that is, BegAmount minus Monthly Amount.
The EndAmount is needed to calculate the MonthlyAmount before EndAmount1 is calculated and carried over to the next row.
Yes the real data has multiple ID's is why I have a by statement. Hopefully this explains the logic.
What is SHOWAMOUNT? Is that a deduction from the account? The balance? Why does it appear on one of the middle months in a group instead of at the start or end of a group of months?
Hi Tom.....ShowAmount is actually revenue received during that particular Month. For example, if you look at the row where ShowAmount = 700. For this particular row, the NumberDaysBeg =10 and NumberDaysEnd =20, that would mean that ShowAmount of 700 was received 10 days into the month which would suggest that it was available for last 20 days of the 30 day month. Because the BegAmount for that row is 0, then calculation for the MonthlyAmount is 467 (((10/30)*(0))+((20/30)*(700))).
The BegAmount will always be the EndAmount1 from the previous row as the Beginning Balance in the current row will always be the Ending Balance from the previous row.
The ShowAmount can be both positive or negative due to revenue or cost. By adding the ShowAmount to the BegAmount to get the EndAmount and then calculate the MonthlyAmount, that should reflect the actual amount of revenue to allocate for that particular month.
The reason why ShowAmounts appear in the middle of the group is because it was actually received in that month. For example, for the first row, the BegAmount is 3650. That means that 3650 was received before it was actually earned. For that particular row or month, services was delivered for 28 days out of a total of 36 days to complete to services to be delivered. It is possible to receive or even refund funds anytime during the time period to deliver the services. The allocation of funds to calculate MonthlyAmount is based on when funds have been received and earned.
I suggest you focus on one variable at a time. Your code is still a bit confusing. You have a BY statement, but it is doing nothing, because you're not using first. and last. variables. It sounds like you're trying to accumulate something (making a running total of some sort), but nothing is being retained.
If you focus on your first variable, EndAmount, your code is essentially:
data want;
set have;
EndAmount = (BegAmount+ShowAmount);
run;
So EndAmount is always being calculated by adding together two values you read from HAVE. Which isn't what you want, is it?
EndAmount is calculated by adding together two values (BegAmount & ShowAmount)that is read in from HAVE for the first row then EndAmount = previous row's EndAmount plus the current row's ShowAmount. The reason for calculating EndAmount1 was to get the new End Balance to carry it forward to the next row as the BegAmount so that EndAmount would continue to be calculated as BegAmount & ShowAmount for that same row as well. I am wondering if doing the calculations for the first row then a loop for the remaining rows as recalculations of the variables are needed.
So you have this data:
data have;
input id days daysmonth beg add ;
add=sum(add,0);
cards;
152 36 28 3650 .
152 36 8 3650 .
152 85 14 4100 .
152 85 31 4100 .
152 85 30 4100 700
152 85 10 4100 .
152 71 19 4650 .
152 71 29 4650 450
152 71 23 4650 .
152 85 25 4650 .
152 85 36 4650 .
152 85 29 4650 .
;
Sounds like for the first group of two months you have decided that the balance should go down from 3,650 at the start of the first month to 0 at the end of the second month and you want to divide that into two pieces that is proportional to the number of days that month has out of the whole period.
But then on the start of the second group the value has jumped back up to 4,100. Where did that come from? Why is that not reflected in your expected output?
Hi Tom,
The 4100 and the rest of the values in the BegAmount are a running total for the balance at the start of each project to account for revenue received between projects. I revised the original post and re-create the expected table in excel and incorporated the same formulas into the data step....Thanks
Do not put LAG() function calls inside of conditional execution. That messes up the list of values that it "lags". LAG() does not know anything about observations. It just returns out the previous value you passed in. If you must use LAG() and only want to used the lagged value sometimes then make sure to execute the LAG() function everytime and then reference the generated value conditionally.
You can also use RETAIN to keep values into the next observation rather than trying to use lag to look backwards. (Only use RETAIN for variables that are being created in this data step. Don't try to retain variables that are coming in from the input dataset because reading the next observation will overwrite the value.)
It can also help to OUTPUT before you are done calculating the things you want to RETAIN.
Hi Tom.....thanks for suggestion about the lag.....that was very helpful. I broke the data step into two data steps. The first to get the first row and starting entries which works fine. The second data step, I defined the lag variables outside the conditional execution and I am getting results for every second row. Is using lag's causing this? Thanks.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.