Hello. I am a novice SAS user looking to learn how to sum numeric column observations in a data step by multiple variables (ID and CY) WITHOUT rolling any records up.
I would like to turn this:
into this:
I can do so using a proc step, and I can roll up records to only show a data table with unique ID observations, but I would like to learn how to utilize the DO loop or a RETAIN or FIRST/LAST in order to generate a summation by CY and ID that does not roll any records up, and simply lists the entire CY sum for individual IDs.
Looking for a point in the right direction, not just to be given the solution if that helps!
Thank you.
Although your sample data is not sorted by ID/CY, it does appear to be grouped by ID/CY. If so, then the DATA step is an efficient way to generate the results you want.
Using a SET statement with a BY statement tells the DATA step to generate two dummy variables for each BY variable. So if you have
set have;
by id cy;
you would get a FIRST.CY=1 whenever you read an observation with a change in CY value. In the example above whenever ID changes values, then FIRST.ID=1 (and all by-variables to its right will also have FIRST. dummy=1). So FIRST.CY=1 whenever CY changes, as well as whenever ID changes. Otherwise first.cy=0.
This allows you a way to determine whenever the observation in hand is the start of a new ID/CY group, i.e. this is when you want to initialize the value of the sum_payment_cy variable. If the sum_payment_cy variable is RETAINED, then adding PAYMENT to it will allow it to accumulate total payments for the given ID/CY group.
At the END of the id/cy group (i.e. when last.cy=1), you now have the value of interest in sum_payment_cy.
But for this sample data the above code will fail because SAS expects the data to be sorted by ID CY. You might choose to pre-sort the data first (see my note at the end). But your desired result is not sorted, (just grouped). To preserve that order just tell SAS that the data is not sorted.
set have;
by id cy notsorted;
Even with the notsorted option, all the first. and last. dummies will work as described.
But once you've calculated the correct sum_payment_cy (when last.cy=1), you've already read in all the members of the id/cy group. So you have to re-read the group and output each record with the freshly calculated sum_payment_cy, as here:
data want (drop=_:);
set have ;
by id cy notsorted;
*Calculate sum_payment_cy *;
retain sum_payment_cy ;
if first.cy then sum_payment_cy=0;
sum_payment_cy=sum_payment_cy+payment;
*Reread and output the id/cy group*;
if last.cy then do _i=1 to coalesce(dif(_n_),_n_);
set have ;
output;
end;
run;
The loop with the "if last.cy" condition rereads x observations, where x is the number of observations in the ID/CY group. X is calculated by the DIF function, which finds the difference in the value of _N_ every time last.cy=1. If that result is missing, as it must be the first time DIF is executed, then X is _N_, the observation number of the last member of the first group.
By the way, if the data were already sorted by ID/CY, the code would be a great deal easier.
data want;
set have (in=firstpass)
have (in=secondpass);
by id cy;
retain sum_payment_cy;
if first.cy then sum_payment_cy=0;
if firstpass=1 then sum_payment_cy=sum_payment_cy+payment;
if secondpass;
run;
SQL is a great choice for this and it is easier than a DATA step solution:
proc sql;
create table want as
select A.*
,B.Sum_Payment_CY
from have as A
left join
(select ID
,CY
,sum(Payment) as Sum_Payment_CY
from have
group by ID
,CY
) as B
on A.ID = B.ID
and A.CY = B.CY
;
quit;
@Unagi - Thanks for the feedback. I think it is worth pointing out that while a DATA step can work equally as well with one level of summarisation, in my experience SQL is way better when you want to do multiple summarisations at different levels. Also your input data doesn't have to be in any particular order for it to work.
BTW, I've corrected the posted query as it was missing a semicolon.
Although your sample data is not sorted by ID/CY, it does appear to be grouped by ID/CY. If so, then the DATA step is an efficient way to generate the results you want.
Using a SET statement with a BY statement tells the DATA step to generate two dummy variables for each BY variable. So if you have
set have;
by id cy;
you would get a FIRST.CY=1 whenever you read an observation with a change in CY value. In the example above whenever ID changes values, then FIRST.ID=1 (and all by-variables to its right will also have FIRST. dummy=1). So FIRST.CY=1 whenever CY changes, as well as whenever ID changes. Otherwise first.cy=0.
This allows you a way to determine whenever the observation in hand is the start of a new ID/CY group, i.e. this is when you want to initialize the value of the sum_payment_cy variable. If the sum_payment_cy variable is RETAINED, then adding PAYMENT to it will allow it to accumulate total payments for the given ID/CY group.
At the END of the id/cy group (i.e. when last.cy=1), you now have the value of interest in sum_payment_cy.
But for this sample data the above code will fail because SAS expects the data to be sorted by ID CY. You might choose to pre-sort the data first (see my note at the end). But your desired result is not sorted, (just grouped). To preserve that order just tell SAS that the data is not sorted.
set have;
by id cy notsorted;
Even with the notsorted option, all the first. and last. dummies will work as described.
But once you've calculated the correct sum_payment_cy (when last.cy=1), you've already read in all the members of the id/cy group. So you have to re-read the group and output each record with the freshly calculated sum_payment_cy, as here:
data want (drop=_:);
set have ;
by id cy notsorted;
*Calculate sum_payment_cy *;
retain sum_payment_cy ;
if first.cy then sum_payment_cy=0;
sum_payment_cy=sum_payment_cy+payment;
*Reread and output the id/cy group*;
if last.cy then do _i=1 to coalesce(dif(_n_),_n_);
set have ;
output;
end;
run;
The loop with the "if last.cy" condition rereads x observations, where x is the number of observations in the ID/CY group. X is calculated by the DIF function, which finds the difference in the value of _N_ every time last.cy=1. If that result is missing, as it must be the first time DIF is executed, then X is _N_, the observation number of the last member of the first group.
By the way, if the data were already sorted by ID/CY, the code would be a great deal easier.
data want;
set have (in=firstpass)
have (in=secondpass);
by id cy;
retain sum_payment_cy;
if first.cy then sum_payment_cy=0;
if firstpass=1 then sum_payment_cy=sum_payment_cy+payment;
if secondpass;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.