Hi all,
I am looking for your help on how to compute 5 year's monthly moving average to compare with the current month. Mean number of cases is based on the past two months of the current year and the current month ± 2 months for 5 previous years (27 months). See illustration in the picture below, which how the yellow colored cells (27 months) are used in moving average to compare with current month, green cell. Data sets is below.
data have;
input Disease $ YEAR Month count;
datalines;
A 2018 1 19
A 2019 1 25
A 2020 1 15
A 2021 1 6
A 2022 1 18
A 2023 1 6
A 2018 2 22
A 2019 2 9
A 2020 2 21
A 2021 2 3
A 2022 2 8
A 2023 2 19
A 2018 3 20
A 2019 3 15
A 2020 3 5
A 2021 3 9
A 2022 3 9
A 2023 3 18
A 2018 4 11
A 2019 4 10
A 2021 4 1
A 2022 4 8
A 2023 4 15
A 2018 5 14
A 2019 5 16
A 2020 5 9
A 2021 5 4
A 2022 5 10
A 2018 6 19
A 2019 6 15
A 2020 6 8
A 2021 6 6
A 2022 6 26
A 2018 7 21
A 2019 7 12
A 2020 7 42
A 2021 7 16
A 2022 7 12
A 2018 8 15
A 2019 8 15
A 2020 8 13
A 2021 8 13
A 2022 8 8
A 2018 9 9
A 2019 9 8
A 2020 9 4
A 2021 9 11
A 2022 9 4
A 2018 10 12
A 2019 10 10
A 2020 10 5
A 2021 10 11
A 2022 10 12
A 2018 11 7
A 2019 11 15
A 2020 11 5
A 2021 11 6
A 2022 11 7
A 2018 12 22
A 2019 12 13
A 2020 12 7
A 2021 12 20
A 2022 12 14
B 2018 1 7
B 2019 1 7
B 2020 1 9
B 2021 1 8
B 2022 1 8
B 2023 1 13
B 2018 2 9
B 2019 2 11
B 2020 2 16
B 2021 2 5
B 2022 2 4
B 2023 2 11
B 2018 3 13
B 2019 3 8
B 2020 3 6
B 2021 3 9
B 2022 3 11
B 2023 3 19
B 2018 4 13
B 2019 4 17
B 2020 4 5
B 2021 4 11
B 2022 4 14
B 2023 4 10
B 2018 5 17
B 2019 5 12
B 2020 5 13
B 2021 5 13
B 2022 5 20
B 2023 5 1
B 2018 6 26
B 2019 6 32
B 2020 6 23
B 2021 6 21
B 2022 6 24
B 2018 7 24
B 2019 7 30
B 2020 7 31
B 2021 7 34
B 2022 7 22
B 2018 8 16
B 2019 8 18
B 2020 8 24
B 2021 8 15
B 2022 8 13
B 2018 9 8
B 2019 9 23
B 2020 9 19
B 2021 9 8
B 2022 9 16
B 2018 10 8
B 2019 10 13
B 2020 10 14
B 2021 10 8
B 2022 10 11
B 2018 11 16
B 2019 11 13
B 2020 11 6
B 2021 11 8
B 2022 11 12
B 2018 12 12
B 2019 12 10
B 2020 12 3
B 2021 12 10
B 2022 12 9
;
run;
Hi @bijayadhikar,
Try this:
proc sql;
create view _tmp as
select *, mdy(month,1,year) as d, min(calculated d) as min_d
from have
group by disease;
create table want as
select b.disease, b.year, b.month, b.count, mean(a.count) as movavg, n(a.count) as n
from _tmp b left join _tmp a
on a.disease=b.disease & mod(intck('month',a.d,b.d)+2,12)<5
& 1<=intck('month',a.d,b.d)<=62
& intck('month',b.min_d,b.d)>=62
group by 1,2,3,4;
drop view _tmp;
quit;
In addition to the moving average (variable movavg), the number n of values contributing to that average is written to the output dataset WANT. So you could exclude results based on too few values (because of missing count values). Moving averages are set to missing if the first time point for their computation is prior to the earliest date in the data (within the Disease group).
I am looking for your help on how to compute 5 year's monthly moving average to compare with the current month. Mean number of cases is based on the past two months of the current year and the current month ± 2 months for 5 previous years (27 months). See illustration in the picture below, which how the yellow colored cells (27 months) are used in moving average to compare with current month, green cell. Data sets is below.
Please walk us through an example or two of calculating this monthly moving average using 5 years data by group. Please explain what you mean by "compare".
Hi @bijayadhikar,
Try this:
proc sql;
create view _tmp as
select *, mdy(month,1,year) as d, min(calculated d) as min_d
from have
group by disease;
create table want as
select b.disease, b.year, b.month, b.count, mean(a.count) as movavg, n(a.count) as n
from _tmp b left join _tmp a
on a.disease=b.disease & mod(intck('month',a.d,b.d)+2,12)<5
& 1<=intck('month',a.d,b.d)<=62
& intck('month',b.min_d,b.d)>=62
group by 1,2,3,4;
drop view _tmp;
quit;
In addition to the moving average (variable movavg), the number n of values contributing to that average is written to the output dataset WANT. So you could exclude results based on too few values (because of missing count values). Moving averages are set to missing if the first time point for their computation is prior to the earliest date in the data (within the Disease group).
Thank you @FreelanceReinh . I am going to try your proposed script.
@PaigeMiller, thanks you. about your question, please find little bit detail about what I was trying to accomplish. Although I did not include standard deviation calc in my original ask, I will be using SD as mentioned in the excel formula
I'm afraid I'm still not clear on what you are doing. In fact I am mystified. It seems you are averaging
Feb 2018, Feb 2019, Feb 2020, Feb 2021, Feb 2022, Feb 2023, Mar 2018, Mar 2019, Mar , Mar 2021, Mar 2022, Mar 2023,
Apr 2018, Apr 2019, Apr 2020, Apr 2021, Apr 2022, May 2018, May 2019, May 2020, May 2021, May 2022,
Jun 2018, Jun 2019, Jun 2020, Jun 2021, Jun 2022
Not only do I not grasp the logic here why you are averaging those months in that way, but I don't see how that logic translates to row 2 and row 3 and row 12 and row 13 and row 14 and row 15 and row 24 and row 25. I would like even more detail than you have provided so far — I would like a complete and clear explanation of this process.
Thanks @PaigeMiller for further follow up. The logics I am applying here is that for disease aberration detection, we apply historical weekly (or monthly) mean as a threshold (i.e. expected value) to compare with the current week (or month) count to say whether current week (or month) disease count is above or below the threshold. More historical data we use, the better expected value we can get.
The 27 months mean: The use of 25 months as a means of center moving average around current month (i.e. April), plus the previous 2 month of current month. This month's disease occurrence is influenced by what happened past couple of months. Fig 2 in this article described a concept. https://downloads.regulations.gov/FDA-2011-N-0731-0001/attachment_3.pdf
The figure 2 is what I am trying to create for each week. In my example, for each month. Does this answer your question?
@bijayadhikar wrote:
Thanks @PaigeMiller for further follow up. The logics I am applying here is that for disease aberration detection, we apply historical weekly (or monthly) mean as a threshold (i.e. expected value) to compare with the current week (or month) count to say whether current week (or month) disease count is above or below the threshold. More historical data we use, the better expected value we can get.
The 27 months mean: The use of 25 months as a means of center moving average around current month (i.e. April), plus the previous 2 month of current month. This month's disease occurrence is influenced by what happened past couple of months. Fig 2 in this article described a concept. https://downloads.regulations.gov/FDA-2011-N-0731-0001/attachment_3.pdf
The figure 2 is what I am trying to create for each week. In my example, for each month. Does this answer your question?
No, it does not. I asked more than one question, so not it does not answer all of my questions, which was the desired result. I want to know how to apply this to row 13 of your table. And how to apply it to row 14 of your table. And so on.
You have now explained why you want those particular 27 months. Although I don't really see a connection between that and Figure 2 of that article, but that's mostly irrelevant now.
@PaigeMiller First the use of 27 months is required in our public health practice and epidemiological view point. The row 2 can be ignored due to prior data (<2018) not available to calculate mean, though we could calculate mean for rows 3 and 4, but I wanted to start at row 5, where I highlighted in green. Data for row 13 (i.e. when we reach Feb 2024) will be available by adding the year 2023 (5 year block from 2019-2023), and Feb 2024 becomes the current month to compare. This is called rolling average. The program I am writing will be automated runs once a month off the database, so it will sends out alert when current month counts is higher than expected.
@FreelanceReinh I did run your script. It did produce a desired result for now. I am not sure if I will have any issue going into next year, i.e. 2024. The 5-years block will be from 2019-2023, and current year will be 2024. Regarding missing values, I will insert missing month and missing values as 'zero'. For example for disease A, no data in July, I will insert 0 for that month. Technically zero is a value (i.e. zero cases reported). I am leaning towards accepting as solution in couple of days after few checks.
@bijayadhikar wrote:
@FreelanceReinh I did run your script. It did produce a desired result for now. I am not sure if I will have any issue going into next year, i.e. 2024. The 5-years block will be from 2019-2023, and current year will be 2024.
There should be no issues with future years. The first two criteria after "a.disease=b.disease" in the ON clause could be summarized equivalently to
intck('month',a.d,b.d) in (1 2 10:14 22:26 34:38 46:50 58:62)
(where 10:14 is the SAS abbreviation for 10 11 12 13 14). The last criterion ensures that the moving average is not calculated if the available count data for the disease in question (whether missing values or not) don't go back to at least 62 months before the "pivot" month. So the code uses only relative date criteria.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.