09-14-2012 03:13 PM
09-14-2012 10:31 PM
Not really sure when your 20 days periods should start and it appears a bit random that it just should be counted based on the first date in your data.
You have also missing dates in your time serie so what should be done if a 20 day period is on a date which is not in the data?
Below code gives you a rolling average over the last 20 days (current day counted as day 1). This should allow you to pick whatever days you need.
create table want as
select distinct l.date,l.return, avg(r.return) as avg_return_last20days
from ftse l left join (select *, min(date) as min_date from ftse) r
on l.date >= r.date and l.date-20<r.date and l.date>=min_date+19
group by l.date
09-14-2012 10:50 PM
This is not a general purpose solution.
It only works for a specific small set of data, and will not work for a dataset that spans more than 40 days.
But, you do raise a lot of good questions.
Those need to be answered before an appropriate solution can be arrived at.
One thing to look into would be the intnx function and its related functions.
I would also recommend using a single data step.
Trying to make proc sql work would be a mental exercise, and is contrary to the basic principles of SAS coding, use the simpliest method to minize coding effort.
09-15-2012 11:37 PM
If you regard a data step or a SQL as easier to implement will mainly depend on your familiarity with the two syntaxes. If the source is a database table then SQL is often preferable.
As to my knowledge there is no out-of-the-box 20 days inteval for SAS calendar functions so you would have to create a custom interval. Not that this is very difficult but I don't really see the value in it as it's so easy to just substract 20 days from a SAS date value.
09-16-2012 08:36 AM
Patrick: Okay. As I carefully read your response the phrase "custom interval" was definitely there. However, like formats, the custom interval datasets can be made permanent, thus can be quite useful for repeating tasks.
09-17-2012 11:51 PM
It is not true that proc sql versus data step is a matter or personal preference or what you learned first.
There are things you can do with a data step that you cannot do with proc sql, because a data step is procedural code and proc sql is not.
And, there are things you can do with proc sql that you cannot do with a data step, because sql is uses predicate algebra and is a subset of a cartesian product.
Many years ago, back in 1998 I was overjoyed to learn that SAS had proc sql to do a proper join with. Trying to code it in a data step is a royal pain.
A merge is not a join, and a join is not a merge, although you can sort of simulate a merge with a special form of join in SAS now.
To create a proper general solution to the question, that does not require a bunch of convoluted coding, is a data step.
It would be much simpler and more elegant, and can work in a general case.
Your solution does not work for a general case because the "max" and "min" as you are using them do not alter as you step through the data. They are static quantities.
A more general solution, using an assumption that missing dates are missing and not to be interpolated:
retain prior total cnt;
if _n_ = 1 then do;
prior = datestamp;
total = 0;
cnt = 0;
if (datestamp - prior) <= 20 then do;
total = sum(total,returns);
if cnt > 0 then average_return = total / cnt;
else average_return = . ;
create table results as
select *, floor(datestamp/20)*20 as base_date, mean(returns) as average_returns
group by floor(datestamp/20)
order by base_date
So, in this case, SQL actually is easier, but produces different results, and are the results close enough to the intent?
To make the two the same, you would have to have a subquery to introduce a bias to the base.
select *,floor((datestamp-least_date)/20)*20 as base_date, mean(returns) as average_returns
from ( select *, min(datestamp) as least_date from in_data)
Then there's the question of 20 day intervals, or 20 observations?
And do you really want 20 day intervals, or a rolling 20 day average?
The Data Step passes through the data once.
The SQL passes through the data 3 times.
So, which is best?
09-18-2012 06:43 AM
One of the things I like about SAS is that there are always many ways for solving a problem. I agree of course that SAS data step and SQL are not the same.
As for the SQL I've posted:
I normally test my code if sample data is provided. The OP attached several years worth of data and it looks to me as if I'm getting correct results also after the first 40 days.
The min() function is used to derive the earliest day in the whole sample and its only purpose is to not calculate an average before there are 20 days of data available (as it's supposed to be a roling 20 days window).
You could easily replace the hard coded days with macro variables so the code I've posted looks "general" enough to me.
As for performance/passes through data:
It sure can be done less wasteful using a data step. But considering the volumes does it really matter? I've just run a test over 50 years of data and it took on my laptop around 14 seconds.