Solved: Data step summation of column observations by multiple variables WITHO...

Unagi · Posted 07-26-2022 07:27 PM

Hello. I am a novice SAS user looking to learn how to sum numeric column observations in a data step by multiple variables (ID and CY) WITHOUT rolling any records up.

I would like to turn this:

into this:

I can do so using a proc step, and I can roll up records to only show a data table with unique ID observations, but I would like to learn how to utilize the DO loop or a RETAIN or FIRST/LAST in order to generate a summation by CY and ID that does not roll any records up, and simply lists the entire CY sum for individual IDs.

Looking for a point in the right direction, not just to be given the solution if that helps!

Thank you.

mkeintz · Posted 07-26-2022 09:33 PM

Although your sample data is not sorted by ID/CY, it does appear to be grouped by ID/CY. If so, then the DATA step is an efficient way to generate the results you want.

Using a SET statement with a BY statement tells the DATA step to generate two dummy variables for each BY variable. So if you have

   set have;
   by id cy;

you would get a FIRST.CY=1 whenever you read an observation with a change in CY value. In the example above whenever ID changes values, then FIRST.ID=1 (and all by-variables to its right will also have FIRST. dummy=1). So FIRST.CY=1 whenever CY changes, as well as whenever ID changes. Otherwise first.cy=0.

This allows you a way to determine whenever the observation in hand is the start of a new ID/CY group, i.e. this is when you want to initialize the value of the sum_payment_cy variable. If the sum_payment_cy variable is RETAINED, then adding PAYMENT to it will allow it to accumulate total payments for the given ID/CY group.

At the END of the id/cy group (i.e. when last.cy=1), you now have the value of interest in sum_payment_cy.

But for this sample data the above code will fail because SAS expects the data to be sorted by ID CY. You might choose to pre-sort the data first (see my note at the end). But your desired result is not sorted, (just grouped). To preserve that order just tell SAS that the data is not sorted.

set have;
by id cy notsorted;

Even with the notsorted option, all the first. and last. dummies will work as described.

But once you've calculated the correct sum_payment_cy (when last.cy=1), you've already read in all the members of the id/cy group. So you have to re-read the group and output each record with the freshly calculated sum_payment_cy, as here:

data want (drop=_:);
  set have ;
  by id cy notsorted;

  *Calculate sum_payment_cy *;
  retain sum_payment_cy ;
  if first.cy then sum_payment_cy=0;
  sum_payment_cy=sum_payment_cy+payment;

  *Reread and output the id/cy group*;
  if last.cy then do _i=1 to coalesce(dif(_n_),_n_);
    set have ;
    output;
  end;
run;

The loop with the "if last.cy" condition rereads x observations, where x is the number of observations in the ID/CY group. X is calculated by the DIF function, which finds the difference in the value of _N_ every time last.cy=1. If that result is missing, as it must be the first time DIF is executed, then X is _N_, the observation number of the last member of the first group.

By the way, if the data were already sorted by ID/CY, the code would be a great deal easier.

data want;
  set have (in=firstpass)
      have (in=secondpass);
  by id cy;
  retain sum_payment_cy;
  if first.cy then sum_payment_cy=0;
  if firstpass=1 then sum_payment_cy=sum_payment_cy+payment;
  if secondpass;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

SASKiwi · Posted 07-26-2022 08:30 PM

SQL is a great choice for this and it is easier than a DATA step solution:

proc sql;
  create table want as
  select A.*
        ,B.Sum_Payment_CY
  from have as A
  left join
  (select ID
          ,CY
          ,sum(Payment) as Sum_Payment_CY
   from have
   group by ID
            ,CY
  ) as B
  on A.ID = B.ID
  and A.CY = B.CY
  ;
quit;

Unagi · Posted 07-27-2022 11:48 AM

This is wonderful information, thank you. I am looking to understand how I might do so in a data step, in order to educate myself as a novice programmer. This is very useful though and will make sure I understand the SQL code. Really appreciate the reply.

SASKiwi · Posted 07-27-2022 05:10 PM

@Unagi - Thanks for the feedback. I think it is worth pointing out that while a DATA step can work equally as well with one level of summarisation, in my experience SQL is way better when you want to do multiple summarisations at different levels. Also your input data doesn't have to be in any particular order for it to work.

BTW, I've corrected the posted query as it was missing a semicolon.

mkeintz · Posted 07-26-2022 09:33 PM

Although your sample data is not sorted by ID/CY, it does appear to be grouped by ID/CY. If so, then the DATA step is an efficient way to generate the results you want.

Using a SET statement with a BY statement tells the DATA step to generate two dummy variables for each BY variable. So if you have

   set have;
   by id cy;

you would get a FIRST.CY=1 whenever you read an observation with a change in CY value. In the example above whenever ID changes values, then FIRST.ID=1 (and all by-variables to its right will also have FIRST. dummy=1). So FIRST.CY=1 whenever CY changes, as well as whenever ID changes. Otherwise first.cy=0.

This allows you a way to determine whenever the observation in hand is the start of a new ID/CY group, i.e. this is when you want to initialize the value of the sum_payment_cy variable. If the sum_payment_cy variable is RETAINED, then adding PAYMENT to it will allow it to accumulate total payments for the given ID/CY group.

At the END of the id/cy group (i.e. when last.cy=1), you now have the value of interest in sum_payment_cy.

But for this sample data the above code will fail because SAS expects the data to be sorted by ID CY. You might choose to pre-sort the data first (see my note at the end). But your desired result is not sorted, (just grouped). To preserve that order just tell SAS that the data is not sorted.

set have;
by id cy notsorted;

Even with the notsorted option, all the first. and last. dummies will work as described.

But once you've calculated the correct sum_payment_cy (when last.cy=1), you've already read in all the members of the id/cy group. So you have to re-read the group and output each record with the freshly calculated sum_payment_cy, as here:

data want (drop=_:);
  set have ;
  by id cy notsorted;

  *Calculate sum_payment_cy *;
  retain sum_payment_cy ;
  if first.cy then sum_payment_cy=0;
  sum_payment_cy=sum_payment_cy+payment;

  *Reread and output the id/cy group*;
  if last.cy then do _i=1 to coalesce(dif(_n_),_n_);
    set have ;
    output;
  end;
run;

The loop with the "if last.cy" condition rereads x observations, where x is the number of observations in the ID/CY group. X is calculated by the DIF function, which finds the difference in the value of _N_ every time last.cy=1. If that result is missing, as it must be the first time DIF is executed, then X is _N_, the observation number of the last member of the first group.

By the way, if the data were already sorted by ID/CY, the code would be a great deal easier.

data want;
  set have (in=firstpass)
      have (in=secondpass);
  by id cy;
  retain sum_payment_cy;
  if first.cy then sum_payment_cy=0;
  if firstpass=1 then sum_payment_cy=sum_payment_cy+payment;
  if secondpass;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Data step summation of column observations by multiple variables WITHOUT rolling up rows

Re: Data step summation of column observations by multiple variables WITHOUT rolling up rows

Re: Data step summation of column observations by multiple variables WITHOUT rolling up rows

Re: Data step summation of column observations by multiple variables WITHOUT rolling up rows

Re: Data step summation of column observations by multiple variables WITHOUT rolling up rows

Re: Data step summation of column observations by multiple variables WITHOUT rolling up rows

Data step summation of column observations by multiple variables WITHOUT rolling up rows

Re: Data step summation of column observations by multiple variables WITHOUT rolling up rows

Re: Data step summation of column observations by multiple variables WITHOUT rolling up rows

Re: Data step summation of column observations by multiple variables WITHOUT rolling up rows

Re: Data step summation of column observations by multiple variables WITHOUT rolling up rows

Re: Data step summation of column observations by multiple variables WITHOUT rolling up rows

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!