Solved: Cumulative based on dates and multiple variables

lmg · Posted 08-12-2019 05:23 PM

Hello,

I need have a cumulative sum of the next couple of weeks (dependent on a factor variable). I can do this with SQL, but is bad for performance and does not run on large data sets.

Basically I want the sum of the next X amount of weeks (based on the column WKfactor) for each unique product and location. Original data does not have the uniquekey, but created it for sql code.

UniqueKey	Product	Location	Week	WKFactor	Value	desired sum
A1	A	1	7-Jan-19	4	43	94	<-sum of 'value' for the next for the next 4 weeks (comes from WK factor) of product A, Location 1
A1	A	1	14-Jan-19	5	2	76	<-sum of 'value' for the next for the next 45weeks (comes from WK factor) of product A, Location 1
A1	A	1	21-Jan-19	4	27	74	<-sum of 'value' for the next for the next 4 weeks (comes from WK factor) of product A, Location 1
A1	A	1	28-Jan-19	4	22	47	<-sum of 'value' for the next for the next 3 weeks (because week factor is 4, but there are only 3 weeks in data set remaining) of product A, Location 1
A1	A	1	4-Feb-19	4	22	25
A1	A	1	11-Feb-19	4	3	3
A2	A	2	7-Jan-19	2	31	66
A2	A	2	14-Jan-19	2	35	76
A2	A	2	21-Jan-19	2	41	47
A2	A	2	28-Jan-19	2	6	6
B1	B	1	7-Jan-19	3	22	101
B1	B	1	14-Jan-19	2	34	79
B1	B	1	21-Jan-19	3	45	97
B1	B	1	28-Jan-19	5	22	52
B1	B	1	4-Feb-19	2	30	30

This is the SQL code that gives me the desired output, but takes too long doing the inner joins on such a large data set. any better ways?

proc sql;
create table work.desire as
select uniquekey, product, location, week, wkfactor, value,
(select sum(value), from work.have where uniquekey=a.uniquekey and week >= a.week and week <= intnx('week', a.week, a.wkfactor)) as desiredsum
from work.have as a;

hashman · Posted 08-12-2019 11:52 PM

@lmg:

Since you have the (PRODUCT,LOCATION) composite key properly sorted, there's no need for UNIQUEKEY to be created: The processing can rely on the composite key alone; and it looks like a straightforward DoW-loop/array job:

data have ;                                                                        
  input (UniqueKey Product) (:$2.) Location Week :date. WKFactor Value ;           
  format week yymmdd10. ;                                                          
  cards ;                                                                          
A1      A      1      07-Jan-19      4      43      94                             
A1      A      1      14-Jan-19      5       2      76                             
A1      A      1      21-Jan-19      4      27      74                             
A1      A      1      28-Jan-19      4      22      47                             
A1      A      1      04-Feb-19      4      22      25                             
A1      A      1      11-Feb-19      4       3       3                             
A2      A      2      07-Jan-19      2      31      66                             
A2      A      2      14-Jan-19      2      35      76                             
A2      A      2      21-Jan-19      2      41      47                             
A2      A      2      28-Jan-19      2       6       6                             
B1      B      1      07-Jan-19      3      22     101                             
B1      B      1      14-Jan-19      2      34      79                             
B1      B      1      21-Jan-19      3      45      97                             
B1      B      1      28-Jan-19      5      22      52                             
B1      B      1      04-Feb-19      2      30      30                             
;                                                                                  
run ;                                                                              
                                                                                   
data want (drop = _:) ;                                                            
  do _i = 1 by 1 until (last.location) ;                                           
    set have ;                                                                     
    by product location ;                                                          
    array wfs [3, 1000000] _temporary_ ;                                           
    wfs [1, _i] = week ;                                                           
    wfs [2, _i] = wkfactor ;                                                       
    do _j = 1 to _i ;                                                              
      if intck ("week", wfs [1, _j], week) < wfs [2, _j] then wfs [3, _j] + value ;
    end ;                                                                          
  end ;                                                                            
  do _i = 1 to _i ;                                                                
    set have ;                                                                     
    desired_value = wfs [3, _i] ;                                                  
    wfs [3, _i] = . ;                                                              
    output ;                                                                       
  end ;                                                                            
run ;

This code relies on the assumption that no (PRODUCT,LOCATION) by-group is larger than 1 million rows. Methinks it's pretty reasonable, and all you sacrifice making it system-wise is about 20 MB of memory. However, if you don't like making assumptions about data whatsoever, be they reasonable or not, you can eschew it by making an extra preliminary pass through the data set to size the upper bounds of the array WFS:

proc sql noprint ;                                                                        
  select max (q) into :hb from (select count (*) as q from have group product, location) ;
quit ;                                                                                    
                                                                                          
data want (drop = _:) ;                                                                   
  do _i = 1 by 1 until (last.location) ;                                                  
    set have ;                                                                            
    by product location ;                                                                 
    array wfs [3, &hb] _temporary_ ;                                                      
    wfs [1, _i] = week ;                                                                  
    wfs [2, _i] = wkfactor ;                                                              
    do _j = 1 to _i ;                                                                     
      if intck ("week", wfs [1, _j], week) < wfs [2, _j] then wfs [3, _j] + value ;       
    end ;                                                                                 
  end ;                                                                                   
  do _i = 1 to _i ;                                                                       
    set have ;                                                                            
    desired_value = wfs [3, _i] ;                                                         
    wfs [3, _i] = . ;                                                                     
    output ;                                                                              
  end ;                                                                                   
run ;

Yet another, even more dynamic approach, is to replace the array with a hash table and thus avoid the extra pass through the data. If you would like to have a stab at this exercise, be my guest; and it you should find it a bit laborious, circle back and someone will sure help you figure it out. Personally, I think that you're perfectly safe with the "1000000" assumption, and the solution using it is most likely to be the fastest time-wise.

Kind regards

Paul D.

View solution in original post

Reeza · Posted 08-12-2019 08:56 PM

Do you numbers always vary between 2 and 5 or can they be anything? Do you have a license for SAS ETS?

You can check what you have licensed in SAS using:

proc setinit; run;

And what's installed on your system using:

proc product_status; run;

The output will be in the log.

@lmg wrote:

Hello,

I need have a cumulative sum of the next couple of weeks (dependent on a factor variable). I can do this with SQL, but is bad for performance and does not run on large data sets.

Basically I want the sum of the next X amount of weeks (based on the column WKfactor) for each unique product and location. Original data does not have the uniquekey, but created it for sql code.

UniqueKey Product Location Week WKFactor Value desired sum

A1 A 1 7-Jan-19 4 43 94 <-sum of 'value' for the next for the next 4 weeks (comes from WK factor) of product A, Location 1

A1 A 1 14-Jan-19 5 2 76 <-sum of 'value' for the next for the next 45weeks (comes from WK factor) of product A, Location 1

A1 A 1 21-Jan-19 4 27 74 <-sum of 'value' for the next for the next 4 weeks (comes from WK factor) of product A, Location 1

A1 A 1 28-Jan-19 4 22 47 <-sum of 'value' for the next for the next 3 weeks (because week factor is 4, but there are only 3 weeks in data set remaining) of product A, Location 1

A1 A 1 4-Feb-19 4 22 25

A1 A 1 11-Feb-19 4 3 3

A2 A 2 7-Jan-19 2 31 66

A2 A 2 14-Jan-19 2 35 76

A2 A 2 21-Jan-19 2 41 47

A2 A 2 28-Jan-19 2 6 6

B1 B 1 7-Jan-19 3 22 101

B1 B 1 14-Jan-19 2 34 79

B1 B 1 21-Jan-19 3 45 97

B1 B 1 28-Jan-19 5 22 52

B1 B 1 4-Feb-19 2 30 30

This is the SQL code that gives me the desired output, but takes too long doing the inner joins on such a large data set. any better ways?
proc sql;
create table work.desire as
select uniquekey, product, location, week, wkfactor, value,
(select sum(value), from work.have where uniquekey=a.uniquekey and week >= a.week and week <= intnx('week', a.week, a.wkfactor)) as desiredsum
from work.have as a;

hashman · Posted 08-12-2019 11:52 PM

@lmg:

Since you have the (PRODUCT,LOCATION) composite key properly sorted, there's no need for UNIQUEKEY to be created: The processing can rely on the composite key alone; and it looks like a straightforward DoW-loop/array job:

data have ;                                                                        
  input (UniqueKey Product) (:$2.) Location Week :date. WKFactor Value ;           
  format week yymmdd10. ;                                                          
  cards ;                                                                          
A1      A      1      07-Jan-19      4      43      94                             
A1      A      1      14-Jan-19      5       2      76                             
A1      A      1      21-Jan-19      4      27      74                             
A1      A      1      28-Jan-19      4      22      47                             
A1      A      1      04-Feb-19      4      22      25                             
A1      A      1      11-Feb-19      4       3       3                             
A2      A      2      07-Jan-19      2      31      66                             
A2      A      2      14-Jan-19      2      35      76                             
A2      A      2      21-Jan-19      2      41      47                             
A2      A      2      28-Jan-19      2       6       6                             
B1      B      1      07-Jan-19      3      22     101                             
B1      B      1      14-Jan-19      2      34      79                             
B1      B      1      21-Jan-19      3      45      97                             
B1      B      1      28-Jan-19      5      22      52                             
B1      B      1      04-Feb-19      2      30      30                             
;                                                                                  
run ;                                                                              
                                                                                   
data want (drop = _:) ;                                                            
  do _i = 1 by 1 until (last.location) ;                                           
    set have ;                                                                     
    by product location ;                                                          
    array wfs [3, 1000000] _temporary_ ;                                           
    wfs [1, _i] = week ;                                                           
    wfs [2, _i] = wkfactor ;                                                       
    do _j = 1 to _i ;                                                              
      if intck ("week", wfs [1, _j], week) < wfs [2, _j] then wfs [3, _j] + value ;
    end ;                                                                          
  end ;                                                                            
  do _i = 1 to _i ;                                                                
    set have ;                                                                     
    desired_value = wfs [3, _i] ;                                                  
    wfs [3, _i] = . ;                                                              
    output ;                                                                       
  end ;                                                                            
run ;

This code relies on the assumption that no (PRODUCT,LOCATION) by-group is larger than 1 million rows. Methinks it's pretty reasonable, and all you sacrifice making it system-wise is about 20 MB of memory. However, if you don't like making assumptions about data whatsoever, be they reasonable or not, you can eschew it by making an extra preliminary pass through the data set to size the upper bounds of the array WFS:

proc sql noprint ;                                                                        
  select max (q) into :hb from (select count (*) as q from have group product, location) ;
quit ;                                                                                    
                                                                                          
data want (drop = _:) ;                                                                   
  do _i = 1 by 1 until (last.location) ;                                                  
    set have ;                                                                            
    by product location ;                                                                 
    array wfs [3, &hb] _temporary_ ;                                                      
    wfs [1, _i] = week ;                                                                  
    wfs [2, _i] = wkfactor ;                                                              
    do _j = 1 to _i ;                                                                     
      if intck ("week", wfs [1, _j], week) < wfs [2, _j] then wfs [3, _j] + value ;       
    end ;                                                                                 
  end ;                                                                                   
  do _i = 1 to _i ;                                                                       
    set have ;                                                                            
    desired_value = wfs [3, _i] ;                                                         
    wfs [3, _i] = . ;                                                                     
    output ;                                                                              
  end ;                                                                                   
run ;

Yet another, even more dynamic approach, is to replace the array with a hash table and thus avoid the extra pass through the data. If you would like to have a stab at this exercise, be my guest; and it you should find it a bit laborious, circle back and someone will sure help you figure it out. Personally, I think that you're perfectly safe with the "1000000" assumption, and the solution using it is most likely to be the fastest time-wise.

Kind regards

Paul D.

lmg · Posted 08-13-2019 10:38 AM

This worked! i realized i need the next 5 weeks and to not include the current week value, but i can adjust the code to accordingly. This is a great solution. Thanks again!

Cumulative based on dates and multiple variables

Re: Cumulative based on dates and multiple variables

Re: Cumulative based on dates and multiple variables

Re: Cumulative based on dates and multiple variables

Re: Cumulative based on dates and multiple variables

Cumulative based on dates and multiple variables

Re: Cumulative based on dates and multiple variables

Re: Cumulative based on dates and multiple variables

Re: Cumulative based on dates and multiple variables

Re: Cumulative based on dates and multiple variables

Registration is open

SAS Training: Just a Click Away