I was asked this week to create a timeline analysis for my group so we can see if there are any patterns or trends in the timeseries data. I had a heck of a time figuring out the code to “bin” my data in 10-minute increments, and so wanted to post what I did here. I also saw a question on the communities site asking about how to compare times when you want to compare different rows / columns, so I’m going to do that first.
Get the Data
I needed data that had repeated measures, had a datetime stamp, but was not so overly complex to make my examples difficult to follow. I found a dataset on the City of Toronto site that was perfect – calls to the City’s Information / Services line, 311. You can get the data here.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
The data was already in a format that I could use, and there was no missing or clearly incorrect data.
My first task is to find out what the average length of time is between calls for each day.
The first thing I want to do is strip out the date from my “creation_date” variable:
proc sql; create table work.test as select *, datepart(creation_date) format=date8. as date from work.import; quit;
I’m doing this process in more steps than I normally do to highlight specific aspects. The next step is I need to assign a rank or a sequence number to each observation; I could use PROC RANK, but as I’m learning the DATA step I figured this was a good reason to use it here:
data work.test; set work.test; by date; retain row_number; if first.ID then do; row_number = 0; end; row_number + 1; run;
Here’s what my data looks like after these first two steps (yes, my row_number starts at 3, and I think this is because I’ve run my DATA Step twice already; it won’t have any effect on the steps below):
Next, we want to establish the length of time between Observation 1 and 2, 2 and 3, etc. – but only for each day. Although doing the length between last call / first call of the next day makes sense, I want to keep each day separate for this example. I’m putting this first step into a new table work.test2.
proc sql; create table work.test2 as select month(a.date) as Month, day(a.date) as Day, a.creation_date as FirstDate, b.creation_date as SecondDate, (b.creation_date- a.creation_date)/60 as TimeDiff
/* Because I’m keeping the comparisons within the same day,
I can just subtract the two days and then divide by 60
minutes to come up with the length between calls */ from work.test a , work.test b
/* Because I’m comparing Row 1 to Row 2, I’ve found the easiest way is to
join the table to itself. */ where day(a.creation_date) = day(a.creation_date) and
/* Because we’re looking at the data on a day-by-day level, I join the
table to itself on Day and Month */ month(a.creation_date) = month(a.creation_date) and b.row_number=a.row_number+1 ;
/* This is ensuring I’m pulling out A B , B C , C D.
Without this line, I’d get A B, B A, B C, C B */ quit;
The next step is to take this new work.test2 and run the query to calculate the average:
proc sql; select month, day, avg(TimeDiff) as avg from work.test2 a group by month, day order by month, day; quit;
Here are the results for the first few days in January 2015:
Now let’s turn to the task that started me off – getting data into 10-minute increments and then generate some sort of timeline.
proc sql outobs=10000;
/* To make the query run a little faster, I only want to output 10,000 rows
create table work.testb as
datepart(creation_date) format=date. As Date, /* Pulls out just the date */
timepart(creation_date) format=time. As Time, /* Pulls out just the time */
/* Takes the creation_date, splits it up into HOURS and MINUTES,
and then breaks the Minutes out into the 10 minute increments.
I also need to convert my hours and minutes to text, as I’m
them with a colon and so they have to be the same data type.
The CASE statements I have are to:
1) If the hour is less than 10, put a 0 in front
(01:34 instead of 1:34, to ensure sorting works) round
the minutes because (32/10)*10 is still 32;
rounding gives me the correct 30 I’m looking for.
2) when the minutes are less than 10, my result will look
like 12:0 which doesn’t make sense.
3) when the minutes are greater 55 (for example, 13:55), my
result will look like 13:60, which also doesn’t make sense. */
(case when (round((minute(creation_date)/10))*10)=60
else put(hour(creation_date),best2.) end),
round((minute(creation_date)/10))*10<10 then '00'
when round((minute(creation_date)/10))*10=60 then '00'
else put(round((minute(creation_date)/10))*10,best2.) end)) as TimeBin
Here’s what I get:
I then put my data into a Count table because I want counts based on my time bins:
proc sql; create table work.testc as select TimeBin, count(*) as Count from work.testb group by TimeBin order by TimeBin; quit;
For the purposes of my time line, I pull all data (not just the 10000 from above) and I get a rather interesting graph. Further analysis would be needed to tease out the dips that are very apparent.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.