Easter has gone by and if you celebrate it, I hope you had fun celebrating with your family. In my last blog I discussed that Easter is not always a treat for data scientist because it's a moving holiday. In this article, I would like to introduce you to:
All these stories relate to date interval length calculations that involve movable holidays. You will see how easy it is to perform complex calculations and data preparation with movable dates like Easter or point-in-time definitions like "the first full weekend in February."
I met my friend George after Easter. He got married two years ago and has a very good marriage. After a few drinks however he admits that spending time with the family of his wife, especially his mother-in-law, is not the most ideal. Fortunately this only happens twice a year, at Easter and Christmas.
George says that his most enjoyable years are those with a long time span between Easter and Christmas and with a short visit from his parents-in-law at Christmas. However as both time intervals depend on movable dates, it's hard to calculate this for future years.
SAS won't be able to change the attitude of his mother-in-law. But we can easily calculate the length of the time interval when George doesn't have to see her.
The following SAS program calculates the "relaxing time interval" for George between Easter and Christmas. It also calculates the number of days his parents-in-law stay with them for Christmas.
data mother_in_law_free_interval;
format Year 8.
EasterSunday date9.
SatBefore2412 weekdatx.;
do Year = 2019 to 2030;
SatBefore2412 = intnx('week.7',mdy(12,23,year),0);
EasterSunday = holiday('EASTER',year);
RelaxingDays = SatBefore2412-EasterSunday;
RelaxingWeeks = intck('WEEK',EasterSunday,SatBefore2412)+1;
DaysXmas = mdy(12,26,year) - SatBefore2412 + 1;
output;
end;
run;
The following table shows that in 2019 there are only 35 weeks between Easter and Christmas. The Saturday before Christmas is Dec 21st which means that Georges family-in-law will stay for 6 days from Dec 21st to Dec 26th.
2022 looks like a hard year for George where the number of relaxing weeks is low and he has them in his house for the maximum of 10 days around Christmas. 2023 looks much better, George!
Joanna is a lecturer at a business school in northern Germany. She only teaches in the winter term which ends on the last Saturday in January. The first full weekend in February is therefore the earliest date when she can be home in Austria and start the skiing season. Many ski lifts in Austria operate until Monday after Easter. So depending on the effective Easter date, Joanna has a varying number of weekends available for skiing every year.
The following simple SAS program calculates the number of "skiing weekends for Joanna" (= number of weekends starting with the first full weekend in February until the Easter weekend).
data skiing_weekends;
format Year 8.
FirstFullSatInFeb weekdatx.
EasterMonday date9.;
do Year = 2015 to 2025;
FirstFullSatInFeb = intnx('week.7',mdy(1,31,year),1);
EasterMonday = holiday('EASTER',year)+1;
NumSkiWeekends = intck('week.1',FirstFullSatInFeb,EasterMonday);
output;
end;
run;
The following table shows the results for the years 2015-2025. You see that the number varies due to the moving Easter date and the fact that the first full weekend in February is moving as well.
Good old friends of my parents are traveling throughout the year across the world. They are real globetrotters: New Zealand, Peru, Canada, Sweden, India, Hawaii, Russia, China are just a selection of countries they visited over the last years. For them the 50 days between Easter and Pentecost are the only time in the year where they can free some time to visit my parents at Lake Neusiedl.
Of course they would like to have good weather when they come to Burgenland in Austria.
The Hydrological office of Burgenland maintains a great web portal with measures in 15-minute intervals of the water level, temperature and precipitation. I also use this data for my data science lectures at the University, where we build a small data warehouse and run some data science analyses. For my lecture the office provided me data on that granularity starting from 1999.
Selected rows of this data are shown below for the precipitation (N) and the temperature (T).
The following query joins the two tables together, and aggregates the measures on a daily level.
proc sql;
*** Daily Aggregation;
create table Weather_Illmitz
as
select datepart(t.datumzeit) as Date format=date9.,
mean(t.wert) as TempMean format=8.1,
max(t.wert) as TempMax format=8.1,
sum(n.wert) as RainSum format=8.1
from h.n_i_oldhist as n,
h.t_i_oldhist as t
where n.datumzeit = t.datumzeit
group by Date;
quit;
As the timestamp in this data is in the format DATETIME, you need to use the DATEPART function to extract the date.
The resulting table WEATHER_ILLMITZ with data on a daily basis looks as follows:
Based on this table you now want to calculate the desired statistics on a yearly basis for the interval between Easter and Pentecost.
proc sql;
*** Yearly Statistics;
create table Weather_Yearly
as select year(Date) as Year,
holiday('easter',year(Date)) as EasterSunday format=date9.,
mdy(month(calculated EasterSunday),day(calculated EasterSunday),2018) as Easter format=date9.,
sum(TempMax > 25) as DaysTemp25,
sum(RainSum > 1) as RainDays
from Weather_Illmitz
where holiday('easter',year(Date)) <= date <= holiday('easter',year(Date))+49
group by Year, EasterSunday;
quit;
The following table reports the number of days with a temperature > 25 °C and the number of RainDays per year between Easter and Pentecost. You see how easy it is to use the HOLIDAY function to select the appropriate records per year.
Based on this data, the following query calculates the overall probability for a
"Temperature > 25 °C" day with 17.8 % and for a "Rain" day with 20.8 %.
Based on that, the friends of my parents can get a clearer picture of the weather conditions
at that time of the year here in Burgenland.
The variable EASTER contains the Easter date for every year shifted to the year 2018. You can use this variable to plot the relationship of the number of "Temp > 25 °C" days with the occurrence of Easter in the year.
proc sgplot data=Weather_Yearly;
reg x=Easter y=DaysTemp25;
run;
You can use the REG statement in the SGPLOT procedure to not only receive a scatter plot but also a regression line.
You see the expected relationship that a year with later Easter date consequently has more "Temp > 25°C" days until Pentecost. Unfortunately, the same is true for the number of Rain Days. The probability for rain also increases with a later Easter date.
You can apply the REG procedure to a simple linear model that quantifies this relationship.
data Weather_Yearly;
set Weather_Yearly;
DaysPast21MAR = Easter - '21MAR2018'd;
run;
proc reg data=Weather_Yearly;
model DaysTemp25 = DaysPast21MAR;
run;
Note that it is advisable to center the dependent variable for the analysis for a numerically more stable estimate. The date variable EASTER contains values around 20,000. If you simply subtract the 21st of March you receive smaller, better computable and better interpretable values as "Days past the earliest possible Easter date."
For my parent's friends, these results mean that he prefers a year with an earlier Easter date as this reduces the risk for a rainy day, while his wife prefers a year with a later Easter date as she can expect more days with a temperature of more the 25 °C. Lets see how they decide 🙂
In one of his DO-LOOP blog contributions, Rick Wicklin explains how you can easily add the regression equation to a regression line plot.
These examples can be easily generalized to business applications. When modeling the demand in the retail, hospitality or leisure industry, it's important to consider whether Easter falls into March or April or whether a particular month has 4 or 5 Saturdays or Sundays. These facts can heavily influence demand pattern, traffic, number of visits.
For predictive modeling and time series forecasting, it's therefore important to be able to calculate the length of simple and complex time intervals and to count the number of special days and events.
Date and Time functions in SAS like the INTNX and the INTCK function as well as the family of the HOLIDAY functions provide great support for these tasks.
More content on data preparation for data science can be found in my SAS Press books. Especially in "Data Preparation for Analytics Using SAS".
Great article using some rarely used SAS date functions. I would add NWKDOM() as a useful function for relative calendar dates. For example, President's Day holiday in the U.S. is the third Monday in February. PresidentDay = NWKDOM(3, 2, 2, &year.). Where: 3 = the third week of the month; 2 = the second day of the week (Monday); 2 = the second month in the year; &year. is a macro variable of the referenced year.
Very good comment! thanks for bringing this up, Jim. Yes, the NWKDOM function is also here to help you with dates. Applying this function is also useful in my examples 1 and 2.
Thanks for sharing this. I will try to use it shortly.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.