I have a dataset that looks like this:
Obs | ID | DATE | … | Jan-03 | Feb-03 | Mar-03 | Apr-03 | May-03 | Jun-03 | … | 1 year | 5 years |
1 | 11 | June 2003 | 0 | 0 | 0.23858 | 109.357 | 0 | 0 | ||||
2 | 22 | May 2003 | 1.4 | 1 | 0 | 0 | 11 | 0 |
I need to calculate the last two columns based on the DATE variable. So if the DATE = June 2003 then 1 Year = sum of values across June 2002 to June 2003. If DATE = May 2003 then 1 Year = sum of values across May 2002 to May 2003.
Please help! I think you can do this with arrays but I am not very familiar using arrays.
Thanks.
data shell; infile datalines dsd truncover; input f:$6. Date:MMDDYY10. Jan2003:BEST12. Feb2003:BEST12. Mar2003:BEST12. Apr2003:BEST12. May2003:BEST12. Jun2003:BEST12. Jul2003:BEST12. Aug2003:BEST12. Sep2003:BEST12. Oct2003:BEST12. Nov2003:BEST12. Dec2003:BEST12. Jan2004:BEST12. Feb2004:BEST12. Mar2004:BEST12. Apr2004:BEST12. May2004:BEST12. Jun2004:BEST12. Jul2004:BEST12. Aug2004:BEST12. Sep2004:BEST12. Oct2004:BEST12. Nov2004:BEST12. Dec2004:BEST12. Jan2005:BEST12. Feb2005:BEST12. Mar2005:BEST12. Apr2005:BEST12. May2005:BEST12. Jun2005:BEST12. Jul2005:BEST12. Aug2005:BEST12. Sep2005:BEST12. Oct2005:BEST12. Nov2005:BEST12. Dec2005:BEST12. Jan2006:BEST12. Feb2006:BEST12. Mar2006:BEST12. Apr2006:BEST12. May2006:BEST12. Jun2006:BEST12. Jul2006:BEST12. Aug2006:BEST12. Sep2006:BEST12. Oct2006:BEST12. Nov2006:BEST12. Dec2006:BEST12. Jan2007:BEST12. Feb2007:BEST12. Mar2007:BEST12. Apr2007:BEST12. May2007:BEST12. Jun2007:BEST12. Jul2007:BEST12. Aug2007:BEST12. Sep2007:BEST12. Oct2007:BEST12. Nov2007:BEST12. Dec2007:BEST12. Jan2004_yr:32. Feb2004_yr:32.; format date monyy7.; datalines4; 01001, 060107, 0,0,0.23858,109.3571,0,0,0.23858,0.08946,0,0,0,0,0,0,0,1.3312,0,0,0.1133,0,73.86797,0,26.34095,0,0,0,0.37603,10.72756,0.10744,0.42847,5.34498,24.03554,0,0,0.91321,0,0,0.0761,0,0.1522,0.25366,0,0.02536,0.05074,0,0,0,0,0,0.31592,0,0.0486,0,0,0,0,0,0.0486,0,0,0,0 01003, 072106, 0,0.04498,0.11694,0.04498,4.54283,0,0.04498,0.06296,0,0,0.08996,0,0,0,0,0,0,0.12736,0.04246,0,1768.785,0,5.47608,0,0.11864,0,0.15819,2.12758,0.0791,1.3182,119.2311,1977.33,0,0,0,0,0.07392,0,0,0,0.18479,0.22174,0.07392,0.91654,0,0.48044,1.18263,0,0,0,0,0,0.45552,0.12614,1.96222,0.07008,0,0.70079,0,0,0,0.04498 01005, 050406, 0,0,0,0.37034,0,0,0.37034,0,0,0,0,0,0,0,0,0,0,0,0,0,22.92384,0,0,0,0,0,2.35742,1.92085,0,0,1.17871,39.32827,0,0,0,0.26193,0,0,0,0,0.12642,0.50572,0,0.42142,0,0,0.63214,0,10.1819,0,0,0,0,0,0.08146,0,0,0,0,0,0,0 01007, 051207, 0.19462,0,0.12974,16.218,4.8654,0,0,0,0,0,0,0,0,0,0,0,4.38758,0.75216,0,0,22.06327,0,0,0,0.2406,0,0,0.30075,0,0.2979,3.36851,53.82838,0,0,0,0,0,0,0,0,0,0.58518,0,0,0,0,0.23408,0,0,0.84553,0,0.56368,0,1.12736,0,0,0,0.67642,0,0,0.19462,0 ;;;; data want; set shell; array x{*} Jan2003--Dec2007; do i=-12 to 0; temp=put(intnx('month',date,i,'s'),monyy7.); one_year=sum(one_year,vvaluex(temp)); end; *keep f date one_year; drop i temp; run;
Arrays could work, although a better approach would be to work with a long data set instead of a wide data set.
But you have to help us first. Provide a portion of your data following these instructions and not via any other method. We can't work with screen captures, and we can't work with fake data sets like the above which don't have legal SAS variable names.
Is this better? I am looking to create new columns titled "1_Year" and "5_Years" by creating sums based on the condition such that if the DATE = June 2003 then 1 Year = sum of values across June 2002 to June 2003. If DATE = May 2003 then 1 Year = sum of values across May 2002 to May 2003.
data shell; infile datalines dsd truncover; input f:$6. Date:MMDDYY10. Jan2003:BEST12. Feb2003:BEST12. Mar2003:BEST12. Apr2003:BEST12. May2003:BEST12. Jun2003:BEST12. Jul2003:BEST12. Aug2003:BEST12. Sep2003:BEST12. Oct2003:BEST12. Nov2003:BEST12. Dec2003:BEST12. Jan2004:BEST12. Feb2004:BEST12. Mar2004:BEST12. Apr2004:BEST12. May2004:BEST12. Jun2004:BEST12. Jul2004:BEST12. Aug2004:BEST12. Sep2004:BEST12. Oct2004:BEST12. Nov2004:BEST12. Dec2004:BEST12. Jan2005:BEST12. Feb2005:BEST12. Mar2005:BEST12. Apr2005:BEST12. May2005:BEST12. Jun2005:BEST12. Jul2005:BEST12. Aug2005:BEST12. Sep2005:BEST12. Oct2005:BEST12. Nov2005:BEST12. Dec2005:BEST12. Jan2006:BEST12. Feb2006:BEST12. Mar2006:BEST12. Apr2006:BEST12. May2006:BEST12. Jun2006:BEST12. Jul2006:BEST12. Aug2006:BEST12. Sep2006:BEST12. Oct2006:BEST12. Nov2006:BEST12. Dec2006:BEST12. Jan2007:BEST12. Feb2007:BEST12. Mar2007:BEST12. Apr2007:BEST12. May2007:BEST12. Jun2007:BEST12. Jul2007:BEST12. Aug2007:BEST12. Sep2007:BEST12. Oct2007:BEST12. Nov2007:BEST12. Dec2007:BEST12. Jan2004_yr:32. Feb2004_yr:32.; datalines4; 01001, 060107, 0,0,0.23858,109.3571,0,0,0.23858,0.08946,0,0,0,0,0,0,0,1.3312,0,0,0.1133,0,73.86797,0,26.34095,0,0,0,0.37603,10.72756,0.10744,0.42847,5.34498,24.03554,0,0,0.91321,0,0,0.0761,0,0.1522,0.25366,0,0.02536,0.05074,0,0,0,0,0,0.31592,0,0.0486,0,0,0,0,0,0.0486,0,0,0,0 01003, 072106, 0,0.04498,0.11694,0.04498,4.54283,0,0.04498,0.06296,0,0,0.08996,0,0,0,0,0,0,0.12736,0.04246,0,1768.785,0,5.47608,0,0.11864,0,0.15819,2.12758,0.0791,1.3182,119.2311,1977.33,0,0,0,0,0.07392,0,0,0,0.18479,0.22174,0.07392,0.91654,0,0.48044,1.18263,0,0,0,0,0,0.45552,0.12614,1.96222,0.07008,0,0.70079,0,0,0,0.04498 01005, 050406, 0,0,0,0.37034,0,0,0.37034,0,0,0,0,0,0,0,0,0,0,0,0,0,22.92384,0,0,0,0,0,2.35742,1.92085,0,0,1.17871,39.32827,0,0,0,0.26193,0,0,0,0,0.12642,0.50572,0,0.42142,0,0,0.63214,0,10.1819,0,0,0,0,0,0.08146,0,0,0,0,0,0,0 01007, 051207, 0.19462,0,0.12974,16.218,4.8654,0,0,0,0,0,0,0,0,0,0,0,4.38758,0.75216,0,0,22.06327,0,0,0,0.2406,0,0,0.30075,0,0.2979,3.36851,53.82838,0,0,0,0,0,0,0,0,0,0.58518,0,0,0,0,0.23408,0,0,0.84553,0,0.56368,0,1.12736,0,0,0,0.67642,0,0,0.19462,0 01009, 072108, 0,0,0.27941,1.32087,1.62569,0.20322,0.20322,0,0,0,0.127,0,0,2.14764,0,0,0,0,0.61012,0,2.0012,0.82977,0.31726,0,0.44388,0,0.98122,0.63078,0.60742,0.62967,1.612,20.90636,0,0,0.09344,0,0,0,0.28926,0.089,0,0.22252,0.06676,0.06676,34.48916,0,0,0,0,0,0,0,0,0,0.25557,0.7454,0,0,0,0.03276,0,2.14764 ;;;; proc print data=shell; format Date date9.; run;
First, thank you for providing data in the proper format. This makes everyone's job easier now.
The idea of using variable names that contain data (in this case month and year) makes me very uncomfortable, as it makes the coding much more difficult. SAS does not know that a variable name of Jan2003 comes before a variable name of Apr2003, as the variable names can only be sorted alphabetically, and it doesn't know what month comes before Jan2004.
You need a long data set, not a wide data set, and then the programming becomes easy, as you can use the built in SAS date functions and now SAS knows that before Jan2004 it was Dec2003. See Maxim 19. So, let me create a long data set to replace your wide data set.
data long;
set shell;
array months jan2003--dec2007;
do i=1 to dim(months);
month = input(vname(months(i)),monyy7.);
value=months(i);
output;
end;
format month date9.; /* This line is optional, the program works without it */
keep f date month value;
run;
data long2;
set long;
/* Next line determines if we are in the 12 months before value in variable date */
prev_12_month = 1<=intck('month',month,date)<=12;
run;
/* Obtain sums */
proc summary data=long2 nway;
class f;
var value/weight=prev_12_month;
output out=want sum=;
run;
If you, from now on, start with a long data set instead of a wide data set, you will be much better off and calculations are simple. SAS knows that Dec 2003 comes one month before Jan 2004. That's done in the INTCK function. So if you start with a long data set in this case, your programming would begin where I have data long2; in the code, and isn't that much simpler than arrays?
I point out one additional benefit of long instead of wide data sets. If you get another data set with different months (in the wide data set this means different variable names), the code would have to be modified. But in the long data set case, there's no change needed in the code at all.
Finally, your original problem statement seems to imply you want 13 months in the sum. That's what you said:
... by creating sums based on the condition such that if the DATE = June 2003 then 1 Year = sum of values across June 2002 to June 2003.
If you really want 13 months instead of the 12 that I programmed, I leave that to you to fix in the above code as a homework assignment.
data shell; infile datalines dsd truncover; input f:$6. Date:MMDDYY10. Jan2003:BEST12. Feb2003:BEST12. Mar2003:BEST12. Apr2003:BEST12. May2003:BEST12. Jun2003:BEST12. Jul2003:BEST12. Aug2003:BEST12. Sep2003:BEST12. Oct2003:BEST12. Nov2003:BEST12. Dec2003:BEST12. Jan2004:BEST12. Feb2004:BEST12. Mar2004:BEST12. Apr2004:BEST12. May2004:BEST12. Jun2004:BEST12. Jul2004:BEST12. Aug2004:BEST12. Sep2004:BEST12. Oct2004:BEST12. Nov2004:BEST12. Dec2004:BEST12. Jan2005:BEST12. Feb2005:BEST12. Mar2005:BEST12. Apr2005:BEST12. May2005:BEST12. Jun2005:BEST12. Jul2005:BEST12. Aug2005:BEST12. Sep2005:BEST12. Oct2005:BEST12. Nov2005:BEST12. Dec2005:BEST12. Jan2006:BEST12. Feb2006:BEST12. Mar2006:BEST12. Apr2006:BEST12. May2006:BEST12. Jun2006:BEST12. Jul2006:BEST12. Aug2006:BEST12. Sep2006:BEST12. Oct2006:BEST12. Nov2006:BEST12. Dec2006:BEST12. Jan2007:BEST12. Feb2007:BEST12. Mar2007:BEST12. Apr2007:BEST12. May2007:BEST12. Jun2007:BEST12. Jul2007:BEST12. Aug2007:BEST12. Sep2007:BEST12. Oct2007:BEST12. Nov2007:BEST12. Dec2007:BEST12. Jan2004_yr:32. Feb2004_yr:32.; format date monyy7.; datalines4; 01001, 060107, 0,0,0.23858,109.3571,0,0,0.23858,0.08946,0,0,0,0,0,0,0,1.3312,0,0,0.1133,0,73.86797,0,26.34095,0,0,0,0.37603,10.72756,0.10744,0.42847,5.34498,24.03554,0,0,0.91321,0,0,0.0761,0,0.1522,0.25366,0,0.02536,0.05074,0,0,0,0,0,0.31592,0,0.0486,0,0,0,0,0,0.0486,0,0,0,0 01003, 072106, 0,0.04498,0.11694,0.04498,4.54283,0,0.04498,0.06296,0,0,0.08996,0,0,0,0,0,0,0.12736,0.04246,0,1768.785,0,5.47608,0,0.11864,0,0.15819,2.12758,0.0791,1.3182,119.2311,1977.33,0,0,0,0,0.07392,0,0,0,0.18479,0.22174,0.07392,0.91654,0,0.48044,1.18263,0,0,0,0,0,0.45552,0.12614,1.96222,0.07008,0,0.70079,0,0,0,0.04498 01005, 050406, 0,0,0,0.37034,0,0,0.37034,0,0,0,0,0,0,0,0,0,0,0,0,0,22.92384,0,0,0,0,0,2.35742,1.92085,0,0,1.17871,39.32827,0,0,0,0.26193,0,0,0,0,0.12642,0.50572,0,0.42142,0,0,0.63214,0,10.1819,0,0,0,0,0,0.08146,0,0,0,0,0,0,0 01007, 051207, 0.19462,0,0.12974,16.218,4.8654,0,0,0,0,0,0,0,0,0,0,0,4.38758,0.75216,0,0,22.06327,0,0,0,0.2406,0,0,0.30075,0,0.2979,3.36851,53.82838,0,0,0,0,0,0,0,0,0,0.58518,0,0,0,0,0.23408,0,0,0.84553,0,0.56368,0,1.12736,0,0,0,0.67642,0,0,0.19462,0 ;;;; data want; set shell; array x{*} Jan2003--Dec2007; do i=-12 to 0; temp=put(intnx('month',date,i,'s'),monyy7.); one_year=sum(one_year,vvaluex(temp)); end; *keep f date one_year; drop i temp; run;
This is really clever code, @Ksharp , thanks!
In real life, there might be cases where 12 months earlier is not in the array. Also if this program is to be run on a regular basis and the months in the data set change from what they are now to some other set of months, the code needs to be dynamic. Starting with a long data set overcomes both of these problems and runs without human intervention.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.