Solved: finding the sum over intervals of time

peatjohnston · Posted 02-13-2012 11:55 AM

Hello all,

I have pulled data from various sorces and merged it into the format below. I want to calculate total precipitation (var=precip) over different time intervals (e.g 2 weeks, 6 weeks, 9 months, etc...) before the reference point displayed by the variables 'X1' and 'X2' (I have more than just these two variables). These reference points are sites that were visited repeatedly for other measurements; the reference points occur several times for each variable several times a year over several years. My original thought was to create a new variable for each metric (2wkprecp, 6wkprecp) and each reference point and use proc means to take sums over those assinged intervals but I think I am heading down the wrong path.

Obs date X1 X2 precip date1

1 02AUG1994 8 19940802

2 03AUG1994 56 19940803

3 04AUG1994 18 19940804

4 05AUG1994 15 19940805

5 06AUG1994 36 19940806

6 07AUG1994 0 19940807

7 08AUG1994 0 19940808

8 09AUG1994 0 19940809

9 10AUG1994 0 19940810

10 11AUG1994 76 19940811

11 12AUG1994 0 19940812

12 13AUG1994 33 19940813

13 14AUG1994 79 19940814

14 15AUG1994 0 19940815

15 16AUG1994 13 19940816

16 17AUG1994 5 19940817

17 18AUG1994 0 19940818

18 19AUG1994 X2 61 19940819

19 20AUG1994 0 19940820

20 21AUG1994 33 19940821

21 22AUG1994 231 19940822

22 23AUG1994 X1 0 19940823

23 24AUG1994 0 19940824

24 25AUG1994 5 19940825

In the end I would like to have a dataset in the form (~ 100 records):

date	refpoint	2wkprecp	6wkprecp	9moprecp
19-Aug-94	X2	336	789	2120
23-Aug-94	X1	531	1053	2201
17-Sep-97	X1	344	718	1938
18-Sep-97	X2	344	718	1905

It should be noted that sometimes X1 and X2 (X3, X4, etc...) are on the same dates.

Any suggestions are welcome.

Thanks in advance.

PGStats · Posted 02-13-2012 02:42 PM

Try this:

data have;
format d date11.;
input d date9. x1 $ 10-15 x2 $ 16-21 precip 22-30;
datalines;
02AUG1994                    8
03AUG1994                   56
04AUG1994                   18
05AUG1994                   15
06AUG1994                   36
07AUG1994                    0
08AUG1994                    0
09AUG1994                    0
10AUG1994                    0
11AUG1994                   76
12AUG1994                    0
13AUG1994                   33
14AUG1994                   79
15AUG1994                    0
16AUG1994                   13
17AUG1994                    5
18AUG1994                    0
19AUG1994          X2       61
20AUG1994                    0
21AUG1994                   33
22AUG1994                  231
23AUG1994    X1              0
24AUG1994                    0
25AUG1994                    5
;

proc transpose data=have
out=refs(rename=(d=refDate col1=station) where=(notspace(station)>0) drop=_:);
by d notsorted;
var x:;
run;

data intervals;
input interval $;
datalines;
WEEK2
WEEK6
MONTH9
;

proc sql;
create table want as
select station, refDate, interval, sum(precip) as totalPrecip
from have, refs, intervals
where d between intnx(interval, refDate-1, -1, "SAME") and refDate
group by station, refDate, interval;

PG

View solution in original post

Astounding · Posted 02-13-2012 01:26 PM

peat,

There are people out there whose SQL is much better than mine, but here are questions that anyone who replies will need to know.

What does DATE actually contain? Is it a SAS date, but being printed in a DATE9 format, or is it a character string?

What does DATE1 actually contain? Is it a numeric being printed without a format, or is it a SAS date being printed in the YYMMDDn8 format?

Good luck.

peatjohnston · Posted 02-13-2012 01:53 PM

Thanks for asking for the clarification.

DATE is a SAS date in DATE9 format. The DATE1 is just a number. Originally I was thinking that I could reference a range based on DATE1. I am willing to format these variables to any format necessary.

Thanks

Astounding · Posted 02-13-2012 02:21 PM

OK, here's an attempt. Like I said, I expect someone who knows SQL much better will fix this and make it look more professional.

I'm going to assume you have already replaced DATE1 with the equivalent SAS date, something along the lines of:

date1 = input( put(date1, 8.), yymmdd8.);

This could then be a piece of the solution, creating _2wkprecip for the X1 records. Please adjust the date ranges using an equal sign (<= or >=), as appropriate. (You may also need to do a little debugging ... I tested pieces of the logic but not the whole thing.)

proc sql;

create table x1 as select distinct x1, date from original

where x1='X1';

create table to_be_summed as select a.x1, a.date, b.precip

from x1 as a, original as b

where (b.date1 < a.date) and (b.date1 + 14 > a.date);

create table x1_2weeks as select x1, date, sum(precip) as _2wkprecip from to_be_summed

group by date;

quit;

Until someone "fixes" the code, you would need a similar group of statements for each time period, and for X2, X3, etc.

There remains the final task of putting the pieces together. But this gives you a starting point.

Good luck.

PGStats · Posted 02-13-2012 02:42 PM

Try this:

data have;
format d date11.;
input d date9. x1 $ 10-15 x2 $ 16-21 precip 22-30;
datalines;
02AUG1994                    8
03AUG1994                   56
04AUG1994                   18
05AUG1994                   15
06AUG1994                   36
07AUG1994                    0
08AUG1994                    0
09AUG1994                    0
10AUG1994                    0
11AUG1994                   76
12AUG1994                    0
13AUG1994                   33
14AUG1994                   79
15AUG1994                    0
16AUG1994                   13
17AUG1994                    5
18AUG1994                    0
19AUG1994          X2       61
20AUG1994                    0
21AUG1994                   33
22AUG1994                  231
23AUG1994    X1              0
24AUG1994                    0
25AUG1994                    5
;

proc transpose data=have
out=refs(rename=(d=refDate col1=station) where=(notspace(station)>0) drop=_:);
by d notsorted;
var x:;
run;

data intervals;
input interval $;
datalines;
WEEK2
WEEK6
MONTH9
;

proc sql;
create table want as
select station, refDate, interval, sum(precip) as totalPrecip
from have, refs, intervals
where d between intnx(interval, refDate-1, -1, "SAME") and refDate
group by station, refDate, interval;

PG

PGStats · Posted 02-13-2012 02:56 PM

After which you may want to add :

proc transpose data=want out=wantTable(drop=_:);

var totalPrecip;

by station refDate;

id interval;

run;

to format the table.

Good luck.

PG

art297 · Posted 02-13-2012 03:00 PM

PGStats: I'm responding, primarily, to say "Nice Solution!"

peatjohnston The only thing I'll add to PGStats' suggested code is to insure that you add a "quit;" statement at the end of the proc sql code.

peatjohnston · Posted 02-13-2012 06:30 PM

This was most helpful and it looks like I need to start learning the SQL Procedure. I can see that you put three 'tables' together and it appears that proc sql understands the temporal values (WEEK6, etc...). I will read up but there are a few things I would like to follow up on.

In the data step:

input d date9. x1 $ 10-15 x2 $ 16-21 precip 22-30;

- I by pased this portion of the code since my date was already formated as a SAS date. But my question is why did you specify the spacings for each variable, is that style or does proc sql want strict spacing at times?

In proc sql:

where d between intnx(interval, refDate-1, -1, "SAME") and refDate

- This ended up including an extra day in the summation and included the refDate itself in the summation (which was not something I discussed). I modified the code to:

where d between intnx(interval, refDate, -1, "SAME") and refDate-1

This gave me the exact number of days up to but not including the sample date(WEEK2 = 14 days before the reDate not including the reDate). With the sample date aside, was there a reason for using refDate-1 in the expression? as it appears that this increases the time interval by 1.

Also, adding the ':' at the end of a variable or unwanted character was not a trick I learned. Is there a reference somewhere where I can find more of the basic programming syntax? I am pretty much self taught and things like this make it a lot easier for me. Thanks again, I have learned a lot from your answer.

PGStats · Posted 02-14-2012 11:05 AM

Learning SQL is costly but a good investment as it can be used with most DBMSs. SAS implementation of SQL is very powerful because it allows almost all of SAS functions and can interact with macro variables. I also find it very efficient. A book on relational databases is probably a better place to start learning SQL than SAS Doc.

The details of the first data step are not important and probably not very optimal as I almost never input text data anymore.

I had shifted the time window for the summation in order to match the 336 value from your example. I wasn't sure what you wanted exactly. In SAS, date values are in units of days so that refDate+1 is the day after refDate. But it is best to stay independent of the internal date representation. So, ideally, your where condition should read: where d between intnx(interval, refDate, -1, "SAME") and intnx("DAY", refDate, -1, "SAME").

Time intervals such as WEEK2 are understood by the intnx function. I have to go back to the documentation every time I use it.

I learned the colon trick the same way you did just a few days ago, on this forum. But it is documented, at least in version 9.3, along with many other variable list notations, it neans all the variables that start with.

PG

Ksharp · Posted 02-14-2012 03:37 AM

How about:

data have;
format d date11.;
input d date9. x1 $ 10-15 x2 $ 16-21 precip 22-30;
datalines;
02AUG1994                    8
03AUG1994                   56
04AUG1994                   18
05AUG1994                   15
06AUG1994                   36
07AUG1994                    0
08AUG1994                    0
09AUG1994                    0
10AUG1994                    0
11AUG1994                   76
12AUG1994                    0
13AUG1994                   33
14AUG1994                   79
15AUG1994                    0
16AUG1994                   13
17AUG1994                    5
18AUG1994                    0
19AUG1994          X2       61
20AUG1994                    0
21AUG1994                   33
22AUG1994                  231
23AUG1994    X1              0
24AUG1994                    0
25AUG1994                    5
;
run;
proc sql feedback;
create table want as
 select d,x1,x2,(select sum(precip) from have  where d between intnx('week2',a.d,-1,'s') and a.d ) as _2wkprecp,
                (select sum(precip) from have  where d between intnx('week6',a.d,-1,'s') and a.d ) as _6wkprecp,
                (select sum(precip) from have  where d between intnx('month9',a.d,-1,'s') and a.d ) as _9moprecp
  from have as a
   where x1 is not missing or x2 is not missing;
quit;

Ksharp

peatjohnston · Posted 02-14-2012 09:13 AM

Hello Ksharp,

I did not run this, but I will look at it later and study it as an alternative as I start learning more about proc sql. My question/comment is with the WHERE statement in the second to last line:

where x1 is not missing or x2 is not missing;

If there is an X1 and X2 ocurring on the same date, would it only select one variable, but not both?

Thanks for the insight.

Peat

Ksharp · Posted 02-15-2012 03:08 AM

Hi. They will select both x1 and x2, because I have wroten x1,x2 in select clause.

Ksharp

peatjohnston · Posted 02-15-2012 11:10 AM

Comparison of methods:

The method by Ksharp appears to be more direct as the run time is much faster (0.38 sec compated to the method from PGStats = 4.25 sec) and also transposes easily into the form I am looking for. One thing that I did not intend to discover from this post is that Kshap's method also works if the X1, X2... VALUES are numbers instead of text place holders as I presented in this post. As it turns out I have a separate set of data that I want to perform a similar operation on and then make calculations based on observed numbers to these metrics calculated by proc sql and in this way the data is already without transposing. I just have to add another 'or X is not missing' to the statement for each of the variables (12 in all) which is just fine with me.

Thank you both, this has turned out to be highly educational for me.

Peat

Register Today!

SAS Training: Just a Click Away