Hi,
I would like to create a new dataset from an old dataset, so that the new dataset will have cumulative sum of variables from the old dataset. This should be done by byvar variables in the old dataset. An old data example is like this temp dataset:
data temp;
input Factor1 $ Factor2 $ rank x y;
datalines;
A B 1 1 2
A B 2 2 4
A B 3 3 6
A C 1 1 7
A C 2 3 8
A C 3 5 9
;
run
The new dataset should be like this temp2 dataset:
data temp2;
input Factor1 $ Factor2 $ rank x y csum_x csum_y;
datalines;
A B 1 1 2 1 2
A B 2 2 4 3 6
A B 3 3 6 6 12
A C 1 1 7 1 7
A C 2 3 8 4 15
A C 3 5 9 9 24
;
run
The cumulative sum of x and y are done by Factor1 and Factor2 combination, and along with the rank variable rank. The general factor variable combination could be more than 2 factor variables. The old dataset temp could potentially has other variables not listed here. The final result should be a SAS dataset like temp2. Is there an IML way to do this? Thanks.
I was hoping to give the OP a chance to think about this problem before providing the solution. But since KSharp wrote the program according to my outline, here is my version:
proc iml;
use temp;
read all var {Factor1 Factor2} into G; /* BY groups */
read all var {x y} into X[c=varNames]; /* the data for each BY group */
close;
/* if the input data is already sorted by Factors, you can find the
rows that separate the BY groups */
byRows = uniqueby(G, 1:ncol(G));
byRows = byRows // nrow(G); /* append the last rows for easy processing */
/* loop over the BY groups and compute the cumulative sums */
S = j(nrow(G), ncol(X), .); /* store the cumulative sums here */
do i = 1 to nrow(byRows)-1;
rowStart = byRows[i]; rowEnd = byRows[i+1];
do j = 1 to ncol(X);
S[rowStart:rowEnd, j] = cusum(X[rowStart:rowEnd, j]);
end;
end;
print G X[c=varNames] S[c=('c'+varNames)];
Why do you want to do this in IML? Seems the right tool is the data step?
As @PeterClemmensen suggested, use data step, assuming TEMP is sorted by Factor1, Factor2,
and rank
data want;
set temp;
by Factor1 Factor2;
if first.Factor2 then call missing(cum_x, cum_y);
cum_x + x;
cum_y + y;
run;
Bart
This appears to be the same question as
but the OP is requesting to do it in IML. Those who have non-IML solutions can post them to the other thread.
Sure, you can do this in IML if you want. I notice you've posted several similar questions to the IML community, so I assume you are trying to learn IML. I'll outline the general idea and give you a chance to experiment with the ideas and improve your IML programming skills.
1. Read the data into IML matrices. Read the factors into a BY-group matrix, G, and the variables into a data matrix, X.
2. Assuming that the input data are already sorted by the factors, use the UNIQUEBY function to find the rows of X that represent the first row for each BY group. As shown in the article, you can append nrow(G) to the vector to make subsequent processing easier.
3. Loop over the BY-group rows. Call the CUSUM function for each variable in X.
4. At the end of the loop over the BY-groups, you will have a matrix (call it C) that contains the cumulative sums of each variable for each BY-group. If you need that in a SAS data set, create a data set from C and merge it with the original data.
Try to program it yourself. If you get stuck, post what you have and others will help.
data temp;
input Factor1 $ Factor2 $ rank x y;
datalines;
A B 1 1 2
A B 2 2 4
A B 3 3 6
A C 1 1 7
A C 2 3 8
A C 3 5 9
A D 1 1 7
A D 2 3 8
A D 3 5 9
;
run;
proc iml;
use temp nobs nobs;
read all var{Factor1 Factor2} into groups;
read all var{Factor1 Factor2 rank x y};
close;
start_idx=uniqueby(groups,(1:2));
end_idx=remove(start_idx,1)-1||nobs;
do i=1 to nrow(start_idx);
csum_x=csum_x//cusum(x[start_idx[i]:end_idx[i]]);
csum_y=csum_y//cusum(y[start_idx[i]:end_idx[i]]);
end;
create want var{Factor1 Factor2 rank x y csum_x csum_y};
append;
close;
quit;
I was hoping to give the OP a chance to think about this problem before providing the solution. But since KSharp wrote the program according to my outline, here is my version:
proc iml;
use temp;
read all var {Factor1 Factor2} into G; /* BY groups */
read all var {x y} into X[c=varNames]; /* the data for each BY group */
close;
/* if the input data is already sorted by Factors, you can find the
rows that separate the BY groups */
byRows = uniqueby(G, 1:ncol(G));
byRows = byRows // nrow(G); /* append the last rows for easy processing */
/* loop over the BY groups and compute the cumulative sums */
S = j(nrow(G), ncol(X), .); /* store the cumulative sums here */
do i = 1 to nrow(byRows)-1;
rowStart = byRows[i]; rowEnd = byRows[i+1];
do j = 1 to ncol(X);
S[rowStart:rowEnd, j] = cusum(X[rowStart:rowEnd, j]);
end;
end;
print G X[c=varNames] S[c=('c'+varNames)];
Thanks Rick. This is a situation that proc means cannot handle directly and other means may be necessary. In particular, this is a cusum, an output of a column not a scalar summary statistic. Though the datastep using missing routine and cusum statement provides the simpler solution, the IML way provides an alternative solution and an opportunity of practice. While Ksharp uses the exact index, start and end (by the way, Ksharp also has a nice sql solution), you overwrite the last element of cusum result for each level of the factor combination (this initially puzzled me a little bit). In addition, pre-allocation of memory and assignment of components of a matrix, instead of repeated concatenation, make your code more efficient. Well done. I learned a lot from your blog, and of course, your book as well.
Nice try, Ksharp. Another solution given in addition to your SQL solution.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.