Hello,
I need have a cumulative sum of the next couple of weeks (dependent on a factor variable). I can do this with SQL, but is bad for performance and does not run on large data sets.
Basically I want the sum of the next X amount of weeks (based on the column WKfactor) for each unique product and location. Original data does not have the uniquekey, but created it for sql code.
| UniqueKey | Product | Location | Week | WKFactor | Value | desired sum | |
| A1 | A | 1 | 7-Jan-19 | 4 | 43 | 94 | <-sum of 'value' for the next for the next 4 weeks (comes from WK factor) of product A, Location 1 |
| A1 | A | 1 | 14-Jan-19 | 5 | 2 | 76 | <-sum of 'value' for the next for the next 45weeks (comes from WK factor) of product A, Location 1 |
| A1 | A | 1 | 21-Jan-19 | 4 | 27 | 74 | <-sum of 'value' for the next for the next 4 weeks (comes from WK factor) of product A, Location 1 |
| A1 | A | 1 | 28-Jan-19 | 4 | 22 | 47 | <-sum of 'value' for the next for the next 3 weeks (because week factor is 4, but there are only 3 weeks in data set remaining) of product A, Location 1 |
| A1 | A | 1 | 4-Feb-19 | 4 | 22 | 25 | |
| A1 | A | 1 | 11-Feb-19 | 4 | 3 | 3 | |
| A2 | A | 2 | 7-Jan-19 | 2 | 31 | 66 | |
| A2 | A | 2 | 14-Jan-19 | 2 | 35 | 76 | |
| A2 | A | 2 | 21-Jan-19 | 2 | 41 | 47 | |
| A2 | A | 2 | 28-Jan-19 | 2 | 6 | 6 | |
| B1 | B | 1 | 7-Jan-19 | 3 | 22 | 101 | |
| B1 | B | 1 | 14-Jan-19 | 2 | 34 | 79 | |
| B1 | B | 1 | 21-Jan-19 | 3 | 45 | 97 | |
| B1 | B | 1 | 28-Jan-19 | 5 | 22 | 52 | |
| B1 | B | 1 | 4-Feb-19 | 2 | 30 | 30 |
This is the SQL code that gives me the desired output, but takes too long doing the inner joins on such a large data set. any better ways?
proc sql;
create table work.desire as
select uniquekey, product, location, week, wkfactor, value,
(select sum(value), from work.have where uniquekey=a.uniquekey and week >= a.week and week <= intnx('week', a.week, a.wkfactor)) as desiredsum
from work.have as a;
@lmg:
Since you have the (PRODUCT,LOCATION) composite key properly sorted, there's no need for UNIQUEKEY to be created: The processing can rely on the composite key alone; and it looks like a straightforward DoW-loop/array job:
data have ;
input (UniqueKey Product) (:$2.) Location Week :date. WKFactor Value ;
format week yymmdd10. ;
cards ;
A1 A 1 07-Jan-19 4 43 94
A1 A 1 14-Jan-19 5 2 76
A1 A 1 21-Jan-19 4 27 74
A1 A 1 28-Jan-19 4 22 47
A1 A 1 04-Feb-19 4 22 25
A1 A 1 11-Feb-19 4 3 3
A2 A 2 07-Jan-19 2 31 66
A2 A 2 14-Jan-19 2 35 76
A2 A 2 21-Jan-19 2 41 47
A2 A 2 28-Jan-19 2 6 6
B1 B 1 07-Jan-19 3 22 101
B1 B 1 14-Jan-19 2 34 79
B1 B 1 21-Jan-19 3 45 97
B1 B 1 28-Jan-19 5 22 52
B1 B 1 04-Feb-19 2 30 30
;
run ;
data want (drop = _:) ;
do _i = 1 by 1 until (last.location) ;
set have ;
by product location ;
array wfs [3, 1000000] _temporary_ ;
wfs [1, _i] = week ;
wfs [2, _i] = wkfactor ;
do _j = 1 to _i ;
if intck ("week", wfs [1, _j], week) < wfs [2, _j] then wfs [3, _j] + value ;
end ;
end ;
do _i = 1 to _i ;
set have ;
desired_value = wfs [3, _i] ;
wfs [3, _i] = . ;
output ;
end ;
run ;
This code relies on the assumption that no (PRODUCT,LOCATION) by-group is larger than 1 million rows. Methinks it's pretty reasonable, and all you sacrifice making it system-wise is about 20 MB of memory. However, if you don't like making assumptions about data whatsoever, be they reasonable or not, you can eschew it by making an extra preliminary pass through the data set to size the upper bounds of the array WFS:
proc sql noprint ;
select max (q) into :hb from (select count (*) as q from have group product, location) ;
quit ;
data want (drop = _:) ;
do _i = 1 by 1 until (last.location) ;
set have ;
by product location ;
array wfs [3, &hb] _temporary_ ;
wfs [1, _i] = week ;
wfs [2, _i] = wkfactor ;
do _j = 1 to _i ;
if intck ("week", wfs [1, _j], week) < wfs [2, _j] then wfs [3, _j] + value ;
end ;
end ;
do _i = 1 to _i ;
set have ;
desired_value = wfs [3, _i] ;
wfs [3, _i] = . ;
output ;
end ;
run ;
Yet another, even more dynamic approach, is to replace the array with a hash table and thus avoid the extra pass through the data. If you would like to have a stab at this exercise, be my guest; and it you should find it a bit laborious, circle back and someone will sure help you figure it out. Personally, I think that you're perfectly safe with the "1000000" assumption, and the solution using it is most likely to be the fastest time-wise.
Kind regards
Paul D.
Do you numbers always vary between 2 and 5 or can they be anything? Do you have a license for SAS ETS?
You can check what you have licensed in SAS using:
proc setinit; run;
And what's installed on your system using:
proc product_status; run;
The output will be in the log.
@lmg wrote:
Hello,
I need have a cumulative sum of the next couple of weeks (dependent on a factor variable). I can do this with SQL, but is bad for performance and does not run on large data sets.
Basically I want the sum of the next X amount of weeks (based on the column WKfactor) for each unique product and location. Original data does not have the uniquekey, but created it for sql code.
UniqueKey Product Location Week WKFactor Value desired sum A1 A 1 7-Jan-19 4 43 94 <-sum of 'value' for the next for the next 4 weeks (comes from WK factor) of product A, Location 1 A1 A 1 14-Jan-19 5 2 76 <-sum of 'value' for the next for the next 45weeks (comes from WK factor) of product A, Location 1 A1 A 1 21-Jan-19 4 27 74 <-sum of 'value' for the next for the next 4 weeks (comes from WK factor) of product A, Location 1 A1 A 1 28-Jan-19 4 22 47 <-sum of 'value' for the next for the next 3 weeks (because week factor is 4, but there are only 3 weeks in data set remaining) of product A, Location 1 A1 A 1 4-Feb-19 4 22 25 A1 A 1 11-Feb-19 4 3 3 A2 A 2 7-Jan-19 2 31 66 A2 A 2 14-Jan-19 2 35 76 A2 A 2 21-Jan-19 2 41 47 A2 A 2 28-Jan-19 2 6 6 B1 B 1 7-Jan-19 3 22 101 B1 B 1 14-Jan-19 2 34 79 B1 B 1 21-Jan-19 3 45 97 B1 B 1 28-Jan-19 5 22 52 B1 B 1 4-Feb-19 2 30 30
This is the SQL code that gives me the desired output, but takes too long doing the inner joins on such a large data set. any better ways?
proc sql; create table work.desire as select uniquekey, product, location, week, wkfactor, value, (select sum(value), from work.have where uniquekey=a.uniquekey and week >= a.week and week <= intnx('week', a.week, a.wkfactor)) as desiredsum from work.have as a;
@lmg:
Since you have the (PRODUCT,LOCATION) composite key properly sorted, there's no need for UNIQUEKEY to be created: The processing can rely on the composite key alone; and it looks like a straightforward DoW-loop/array job:
data have ;
input (UniqueKey Product) (:$2.) Location Week :date. WKFactor Value ;
format week yymmdd10. ;
cards ;
A1 A 1 07-Jan-19 4 43 94
A1 A 1 14-Jan-19 5 2 76
A1 A 1 21-Jan-19 4 27 74
A1 A 1 28-Jan-19 4 22 47
A1 A 1 04-Feb-19 4 22 25
A1 A 1 11-Feb-19 4 3 3
A2 A 2 07-Jan-19 2 31 66
A2 A 2 14-Jan-19 2 35 76
A2 A 2 21-Jan-19 2 41 47
A2 A 2 28-Jan-19 2 6 6
B1 B 1 07-Jan-19 3 22 101
B1 B 1 14-Jan-19 2 34 79
B1 B 1 21-Jan-19 3 45 97
B1 B 1 28-Jan-19 5 22 52
B1 B 1 04-Feb-19 2 30 30
;
run ;
data want (drop = _:) ;
do _i = 1 by 1 until (last.location) ;
set have ;
by product location ;
array wfs [3, 1000000] _temporary_ ;
wfs [1, _i] = week ;
wfs [2, _i] = wkfactor ;
do _j = 1 to _i ;
if intck ("week", wfs [1, _j], week) < wfs [2, _j] then wfs [3, _j] + value ;
end ;
end ;
do _i = 1 to _i ;
set have ;
desired_value = wfs [3, _i] ;
wfs [3, _i] = . ;
output ;
end ;
run ;
This code relies on the assumption that no (PRODUCT,LOCATION) by-group is larger than 1 million rows. Methinks it's pretty reasonable, and all you sacrifice making it system-wise is about 20 MB of memory. However, if you don't like making assumptions about data whatsoever, be they reasonable or not, you can eschew it by making an extra preliminary pass through the data set to size the upper bounds of the array WFS:
proc sql noprint ;
select max (q) into :hb from (select count (*) as q from have group product, location) ;
quit ;
data want (drop = _:) ;
do _i = 1 by 1 until (last.location) ;
set have ;
by product location ;
array wfs [3, &hb] _temporary_ ;
wfs [1, _i] = week ;
wfs [2, _i] = wkfactor ;
do _j = 1 to _i ;
if intck ("week", wfs [1, _j], week) < wfs [2, _j] then wfs [3, _j] + value ;
end ;
end ;
do _i = 1 to _i ;
set have ;
desired_value = wfs [3, _i] ;
wfs [3, _i] = . ;
output ;
end ;
run ;
Yet another, even more dynamic approach, is to replace the array with a hash table and thus avoid the extra pass through the data. If you would like to have a stab at this exercise, be my guest; and it you should find it a bit laborious, circle back and someone will sure help you figure it out. Personally, I think that you're perfectly safe with the "1000000" assumption, and the solution using it is most likely to be the fastest time-wise.
Kind regards
Paul D.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.