Solved: Cumulative sum by byvar in the order of rank variable

Macro · Posted 10-21-2021 12:40 AM

Hi,

I would like to create a new dataset from an old dataset, so that the new dataset will have cumulative sum of variables from the old dataset. This should be done by byvar variables in the old dataset. An old data example is like this temp dataset:

data temp;
input Factor1 $  Factor2 $  rank x y;
datalines;
A  B  1  1  2
A  B  2  2  4
A  B  3  3  6
A  C  1  1  7
A  C  2  3  8
A  C  3  5  9
;
run

The new dataset should be like this temp2 dataset:

data temp2;
input Factor1 $  Factor2 $  rank x y csum_x  csum_y;
datalines;
A  B  1  1  2  1  2
A  B  2  2  4  3  6
A  B  3  3  6  6  12
A  C  1  1  7  1  7
A  C  2  3  8  4  15
A  C  3  5  9  9  24
;
run

The cumulative sum of x and y are done by Factor1 and Factor2 combination, and along with the rank variable rank. The general factor variable combination could be more than 2 factor variables. The old dataset temp could potentially has other variables not listed here. The final result should be a SAS dataset like temp2. Is there an IML way to do this? Thanks.

Rick_SAS · Posted 10-21-2021 08:25 AM

I was hoping to give the OP a chance to think about this problem before providing the solution. But since KSharp wrote the program according to my outline, here is my version:

proc iml;
use temp;
read all var {Factor1 Factor2} into G;  /* BY groups */
read all var {x y} into X[c=varNames];  /* the data for each BY group */
close;

/* if the input data is already sorted by Factors, you can find the 
   rows that separate the BY groups */
byRows = uniqueby(G, 1:ncol(G));
byRows = byRows // nrow(G);    /* append the last rows for easy processing */ 

/* loop over the BY groups and compute the cumulative sums */
S = j(nrow(G), ncol(X), .);   /* store the cumulative sums here */
do i = 1 to nrow(byRows)-1;
   rowStart = byRows[i]; rowEnd = byRows[i+1]; 
   do j = 1 to ncol(X);
      S[rowStart:rowEnd, j] = cusum(X[rowStart:rowEnd, j]);
   end;
end;

print G X[c=varNames] S[c=('c'+varNames)];

View solution in original post

PeterClemmensen · Posted 10-21-2021 02:41 AM

Why do you want to do this in IML? Seems the right tool is the data step?

The DATA to DATA Step Macro
Blog: SASnrd

yabwon · Posted 10-21-2021 03:27 AM

As @PeterClemmensen suggested, use data step, assuming TEMP is sorted by Factor1, Factor2, and rank

data want;
 set temp;
 by Factor1 Factor2;
  if first.Factor2 then call missing(cum_x, cum_y);
  cum_x + x;
  cum_y + y;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Rick_SAS · Posted 10-21-2021 06:04 AM

This appears to be the same question as

https://communities.sas.com/t5/SAS-Programming/Cumulative-Sum-in-the-order-of-rank-variable-by-byvar...

but the OP is requesting to do it in IML. Those who have non-IML solutions can post them to the other thread.

Sure, you can do this in IML if you want. I notice you've posted several similar questions to the IML community, so I assume you are trying to learn IML. I'll outline the general idea and give you a chance to experiment with the ideas and improve your IML programming skills.

1. Read the data into IML matrices. Read the factors into a BY-group matrix, G, and the variables into a data matrix, X.

2. Assuming that the input data are already sorted by the factors, use the UNIQUEBY function to find the rows of X that represent the first row for each BY group. As shown in the article, you can append nrow(G) to the vector to make subsequent processing easier.

3. Loop over the BY-group rows. Call the CUSUM function for each variable in X.

4. At the end of the loop over the BY-groups, you will have a matrix (call it C) that contains the cumulative sums of each variable for each BY-group. If you need that in a SAS data set, create a data set from C and merge it with the original data.

Try to program it yourself. If you get stuck, post what you have and others will help.

Ksharp · Posted 10-21-2021 07:49 AM

data temp;
input Factor1 $  Factor2 $  rank x y;
datalines;
A  B  1  1  2
A  B  2  2  4
A  B  3  3  6
A  C  1  1  7
A  C  2  3  8
A  C  3  5  9
A  D  1  1  7
A  D  2  3  8
A  D  3  5  9
;
run;
proc iml;
use temp nobs nobs;
read all var{Factor1  Factor2} into groups;
read all var{Factor1  Factor2 rank x y};
close;
start_idx=uniqueby(groups,(1:2));
end_idx=remove(start_idx,1)-1||nobs;
do i=1 to nrow(start_idx);
 csum_x=csum_x//cusum(x[start_idx[i]:end_idx[i]]);
 csum_y=csum_y//cusum(y[start_idx[i]:end_idx[i]]);
end;

create want var{Factor1  Factor2 rank x y csum_x csum_y};
append;
close;

quit;

Rick_SAS · Posted 10-21-2021 08:25 AM

I was hoping to give the OP a chance to think about this problem before providing the solution. But since KSharp wrote the program according to my outline, here is my version:

proc iml;
use temp;
read all var {Factor1 Factor2} into G;  /* BY groups */
read all var {x y} into X[c=varNames];  /* the data for each BY group */
close;

/* if the input data is already sorted by Factors, you can find the 
   rows that separate the BY groups */
byRows = uniqueby(G, 1:ncol(G));
byRows = byRows // nrow(G);    /* append the last rows for easy processing */ 

/* loop over the BY groups and compute the cumulative sums */
S = j(nrow(G), ncol(X), .);   /* store the cumulative sums here */
do i = 1 to nrow(byRows)-1;
   rowStart = byRows[i]; rowEnd = byRows[i+1]; 
   do j = 1 to ncol(X);
      S[rowStart:rowEnd, j] = cusum(X[rowStart:rowEnd, j]);
   end;
end;

print G X[c=varNames] S[c=('c'+varNames)];

Macro · Posted 10-23-2021 03:44 PM

Thanks Rick. This is a situation that proc means cannot handle directly and other means may be necessary. In particular, this is a cusum, an output of a column not a scalar summary statistic. Though the datastep using missing routine and cusum statement provides the simpler solution, the IML way provides an alternative solution and an opportunity of practice. While Ksharp uses the exact index, start and end (by the way, Ksharp also has a nice sql solution), you overwrite the last element of cusum result for each level of the factor combination (this initially puzzled me a little bit). In addition, pre-allocation of memory and assignment of components of a matrix, instead of repeated concatenation, make your code more efficient. Well done. I learned a lot from your blog, and of course, your book as well.

Ksharp · Posted 10-24-2021 05:53 AM

"you overwrite the last element of cusum result for each level of the factor combination (this initially puzzled me a little bit). "

Me too. But Rick get right result .

Macro · Posted 10-23-2021 03:28 PM

Nice try, Ksharp. Another solution given in addition to your SQL solution.

Cumulative sum by byvar in the order of rank variable

Re: Cumulative sum by byvar in the order of rank variable

Re: Cumulative sum by byvar in the order of rank variable

Re: Cumulative sum by byvar in the order of rank variable

Re: Cumulative sum by byvar in the order of rank variable

Re: Cumulative sum by byvar in the order of rank variable

Re: Cumulative sum by byvar in the order of rank variable

Re: Cumulative sum by byvar in the order of rank variable

Re: Cumulative sum by byvar in the order of rank variable

Re: Cumulative sum by byvar in the order of rank variable

SAS Innovate 2025: Call for Content