DATA Step, Macro, Functions and more

finding the sum over intervals of time

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 17
Accepted Solution

finding the sum over intervals of time

Hello all,

I have pulled data from various sorces and merged it into the format below.  I want to calculate total precipitation (var=precip) over different time intervals (e.g 2 weeks, 6 weeks, 9 months, etc...) before the reference point displayed by the variables 'X1' and 'X2' (I have more than just these two variables).  These reference points are sites that were visited repeatedly for other measurements; the reference points occur several times for each variable several times a year over several years.  My original thought was to create a new variable for each metric (2wkprecp, 6wkprecp) and each reference point and use proc means to take sums over those assinged intervals but I think I am heading down the wrong path.

                             Obs         date    X1    X2    precip      date1

                          1    02AUG1994                    8     19940802

                          2    03AUG1994                   56     19940803

                          3    04AUG1994                   18     19940804

                          4    05AUG1994                   15     19940805

                          5    06AUG1994                   36     19940806

                          6    07AUG1994                    0     19940807

                          7    08AUG1994                    0     19940808

                          8    09AUG1994                    0     19940809

                          9    10AUG1994                    0     19940810

                         10    11AUG1994                   76     19940811

                         11    12AUG1994                    0     19940812

                         12    13AUG1994                   33     19940813

                         13    14AUG1994                   79     19940814

                         14    15AUG1994                    0     19940815

                         15    16AUG1994                   13     19940816

                         16    17AUG1994                    5     19940817

                         17    18AUG1994                    0     19940818

                         18    19AUG1994          X2       61     19940819

                         19    20AUG1994                    0     19940820

                         20    21AUG1994                   33     19940821

                         21    22AUG1994                  231     19940822

                         22    23AUG1994    X1              0     19940823

                         23    24AUG1994                    0     19940824

                         24    25AUG1994                    5     19940825

In the end I would like to have a dataset in the form (~ 100 records):

daterefpoint2wkprecp6wkprecp9moprecp
19-Aug-94   X23367892120
23-Aug-94   X153110532201
17-Sep-97   X13447181938
18-Sep-97  X23447181905

It should be noted that sometimes X1 and X2 (X3, X4, etc...) are on the same dates.

Any suggestions are welcome.

Thanks in advance.


Accepted Solutions
Solution
‎02-13-2012 02:42 PM
Respected Advisor
Posts: 4,918

finding the sum over intervals of time

Posted in reply to peatjohnston

Try this:

data have;
format d date11.;
input d date9. x1 $ 10-15 x2 $ 16-21 precip 22-30;
datalines;
02AUG1994                    8
03AUG1994                   56
04AUG1994                   18
05AUG1994                   15
06AUG1994                   36
07AUG1994                    0
08AUG1994                    0
09AUG1994                    0
10AUG1994                    0
11AUG1994                   76
12AUG1994                    0
13AUG1994                   33
14AUG1994                   79
15AUG1994                    0
16AUG1994                   13
17AUG1994                    5
18AUG1994                    0
19AUG1994          X2       61
20AUG1994                    0
21AUG1994                   33
22AUG1994                  231
23AUG1994    X1              0
24AUG1994                    0
25AUG1994                    5
;

proc transpose data=have
out=refs(rename=(d=refDate col1=station) where=(notspace(station)>0) drop=_Smiley Happy;
by d notsorted;
var x:;
run;

data intervals;
input interval $;
datalines;
WEEK2
WEEK6
MONTH9
;

proc sql;
create table want as
select station, refDate, interval, sum(precip) as totalPrecip
from have, refs, intervals
where d between intnx(interval, refDate-1, -1, "SAME") and refDate
group by station, refDate, interval;

PG

View solution in original post


All Replies
Super User
Posts: 5,495

finding the sum over intervals of time

Posted in reply to peatjohnston

peat,

There are people out there whose SQL is much better than mine, but here are questions that anyone who replies will need to know.

What does DATE actually contain?  Is it a SAS date, but being printed in a DATE9 format, or is it a character string?

What does DATE1 actually contain?  Is it a numeric being printed without a format, or is it a SAS date being printed in the YYMMDDn8 format?

Good luck.

Occasional Contributor
Posts: 17

finding the sum over intervals of time

Posted in reply to Astounding

Thanks for asking for the clarification.

DATE is a SAS date in DATE9 format.  The DATE1 is just a number.  Originally I was thinking that I could reference a range based on DATE1.  I am willing to format these variables to any format necessary.

Thanks

Super User
Posts: 5,495

finding the sum over intervals of time

Posted in reply to peatjohnston

OK, here's an attempt.  Like I said, I expect someone who knows SQL much better will fix this and make it look more professional.

I'm going to assume you have already replaced DATE1 with the equivalent SAS date, something along the lines of:

date1 = input( put(date1, 8.), yymmdd8.);

This could then be a piece of the solution, creating _2wkprecip for the X1 records.  Please adjust the date ranges using an equal sign (<= or >=), as appropriate.  (You may also need to do a little debugging ... I tested pieces of the logic but not the whole thing.)

proc sql;

   create table x1 as select distinct x1, date from original

           where x1='X1';

   create table to_be_summed as select a.x1, a.date, b.precip

           from x1 as a, original as b

           where (b.date1 < a.date) and (b.date1 + 14 > a.date);

   create table x1_2weeks as select x1, date, sum(precip) as _2wkprecip from to_be_summed

           group by date;

quit;

Until someone "fixes" the code, you would need a similar group of statements for each time period, and for X2, X3, etc.

There remains the final task of putting the pieces together.  But this gives you a starting point.

Good luck.

Solution
‎02-13-2012 02:42 PM
Respected Advisor
Posts: 4,918

finding the sum over intervals of time

Posted in reply to peatjohnston

Try this:

data have;
format d date11.;
input d date9. x1 $ 10-15 x2 $ 16-21 precip 22-30;
datalines;
02AUG1994                    8
03AUG1994                   56
04AUG1994                   18
05AUG1994                   15
06AUG1994                   36
07AUG1994                    0
08AUG1994                    0
09AUG1994                    0
10AUG1994                    0
11AUG1994                   76
12AUG1994                    0
13AUG1994                   33
14AUG1994                   79
15AUG1994                    0
16AUG1994                   13
17AUG1994                    5
18AUG1994                    0
19AUG1994          X2       61
20AUG1994                    0
21AUG1994                   33
22AUG1994                  231
23AUG1994    X1              0
24AUG1994                    0
25AUG1994                    5
;

proc transpose data=have
out=refs(rename=(d=refDate col1=station) where=(notspace(station)>0) drop=_Smiley Happy;
by d notsorted;
var x:;
run;

data intervals;
input interval $;
datalines;
WEEK2
WEEK6
MONTH9
;

proc sql;
create table want as
select station, refDate, interval, sum(precip) as totalPrecip
from have, refs, intervals
where d between intnx(interval, refDate-1, -1, "SAME") and refDate
group by station, refDate, interval;

PG
Respected Advisor
Posts: 4,918

finding the sum over intervals of time

After which you may want to add :

proc transpose data=want out=wantTable(drop=_Smiley Happy;

var totalPrecip;

by station refDate;

id interval;

run;

to format the table.

Good luck.

PG
PROC Star
Posts: 7,467

finding the sum over intervals of time

PGStats: I'm responding, primarily, to say "Nice Solution!"

peatjohnston The only thing I'll add to PGStats' suggested code is to insure that you add a "quit;" statement at the end of the proc sql code.

Occasional Contributor
Posts: 17

finding the sum over intervals of time

This was most helpful and it looks like I need to start learning the SQL Procedure.  I can see that you put three 'tables' together and it appears that proc sql understands the temporal values (WEEK6, etc...). I will read up but there are a few things I would like to follow up on.

In the data step:

input d date9. x1 $ 10-15 x2 $ 16-21 precip 22-30;

- I by pased this portion of the code since my date was already formated as a SAS date.  But my question is why did you specify the spacings for each variable, is that style or does proc sql want strict spacing at times?

In proc sql:

where d between intnx(interval, refDate-1, -1, "SAME") and refDate

- This ended up including an extra day in the summation and included the refDate itself in the summation (which was not something I discussed).  I modified the code to:

where d between intnx(interval, refDate, -1, "SAME") and refDate-1

This gave me the exact number of days up to but not including the sample date(WEEK2 = 14 days before the reDate not including the reDate).  With the sample date aside, was there a reason for using refDate-1 in the expression? as it appears that this increases the time interval by 1.

Also, adding the ':' at the end of a variable or unwanted character was not a trick I learned.  Is there a reference somewhere where I can find more of the basic programming syntax?  I am pretty much self taught and things like this make it a lot easier for me.  Thanks again, I have learned a lot from your answer.

Respected Advisor
Posts: 4,918

finding the sum over intervals of time

Posted in reply to peatjohnston

Learning SQL is costly but a good investment as it can be used with most DBMSs. SAS implementation of SQL is very powerful because it allows almost all of SAS functions and can interact with macro variables. I also find it very efficient. A book on relational databases is probably a better place to start learning SQL than SAS Doc.

The details of the first data step are not important and probably not very optimal as I almost never input text data anymore.

I had shifted the time window for the summation in order to match the 336 value from your example. I wasn't sure what you wanted exactly. In SAS, date values are in units of days so that refDate+1 is the day after refDate. But it is best to stay independent of the internal date representation. So, ideally, your where condition should read: where d between intnx(interval, refDate, -1, "SAME") and intnx("DAY", refDate, -1, "SAME").

Time intervals such as WEEK2 are understood by the intnx function. I have to go back to the documentation every time I use it.

I learned the colon trick the same way you did just a few days ago, on this forum. But it is documented, at least in version 9.3, along with many other variable list notations, it neans all the variables that start with.

PG

PG
Super User
Posts: 10,018

finding the sum over intervals of time

Posted in reply to peatjohnston

How about:

data have;
format d date11.;
input d date9. x1 $ 10-15 x2 $ 16-21 precip 22-30;
datalines;
02AUG1994                    8
03AUG1994                   56
04AUG1994                   18
05AUG1994                   15
06AUG1994                   36
07AUG1994                    0
08AUG1994                    0
09AUG1994                    0
10AUG1994                    0
11AUG1994                   76
12AUG1994                    0
13AUG1994                   33
14AUG1994                   79
15AUG1994                    0
16AUG1994                   13
17AUG1994                    5
18AUG1994                    0
19AUG1994          X2       61
20AUG1994                    0
21AUG1994                   33
22AUG1994                  231
23AUG1994    X1              0
24AUG1994                    0
25AUG1994                    5
;
run;
proc sql feedback;
create table want as
 select d,x1,x2,(select sum(precip) from have  where d between intnx('week2',a.d,-1,'s') and a.d ) as _2wkprecp,
                (select sum(precip) from have  where d between intnx('week6',a.d,-1,'s') and a.d ) as _6wkprecp,
                (select sum(precip) from have  where d between intnx('month9',a.d,-1,'s') and a.d ) as _9moprecp
  from have as a
   where x1 is not missing or x2 is not missing;
quit;


Ksharp

Occasional Contributor
Posts: 17

finding the sum over intervals of time

Hello Ksharp,

I did not run this, but I will look at it later and study it as an alternative as I start learning more about proc sql.  My question/comment is with the WHERE statement in the second to last line:

where x1 is not missing or x2 is not missing;

If there is an X1 and X2 ocurring on the same date, would it only select one variable, but not both?

Thanks for the insight.

Peat

Super User
Posts: 10,018

finding the sum over intervals of time

Posted in reply to peatjohnston

Hi. They will select both x1 and x2, because I have wroten x1,x2 in select clause.

Ksharp

Occasional Contributor
Posts: 17

finding the sum over intervals of time

Comparison of methods:

The method by Ksharp appears to be more direct as the run time is much faster (0.38 sec compated to the method from PGStats = 4.25 sec) and also transposes easily into the form I am looking for.  One thing that I did not intend to discover from this post is that Kshap's method also works if the X1, X2... VALUES are numbers instead of text place holders as I presented in this post.  As it turns out I have a separate set of data that I want to perform a similar operation on and then make calculations based on observed numbers to these metrics calculated by proc sql and in this way the data is already without transposing.  I just have to add another 'or X is not missing' to the statement for each of the variables (12 in all) which is just fine with me.

Thank you both, this has turned out to be highly educational for me.

Peat

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 2157 views
  • 3 likes
  • 5 in conversation