Further to my post last week about the US Primaries, I wanted to find more American data to explore. After some poking around, I found a couple of great datasets at https://analytics.usa.gov. One in particular was about the traffic on US Government websites, and I was intrigued. Would there be anything relevant to the upcoming elections?
Get the Data
I recommend going through the Analytics website – definitely enough there for me to write a year’s worth of Free Data Friday posts! However, the data I used for this article came from https://analytics.usa.gov/data/live/all-domains-30-days.csv. The data imported into SAS University Edition without issue. Note: the data is for the past 30 days based on the date you’re pulling the data, so numbers will change. I ran my data on October 30, 2016.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
Nothing was required to get the data ready – it was already in a format that I could use, and there were no missing or clearly incorrect data.
So the first thing I wanted to do is get a sense of the data, for which I did a simple scatterplot using the Task that comes with SAS University Edition:
However, when I run this task, I get an error message I’ve not seen before:
Gah! What the heck am I supposed to do now? Unfortunately, we can’t use the task as it is. However, I can copy the code and make a couple of minor tweaks:
/*--Set output size--*/ ods graphics / discretemax=2000 imagemap=off; /* The discretemax allows me to turn off the default of 1000 distinct datapoints and customize it. Turning the imagemap off removes the mouseovers for each datapoint */ /*--SGPLOT proc statement--*/ proc sgplot data=WORK.IMPORT ; /*--Scatter plot settings--*/ scatter x=domain y=users / transparency=0.0 name='Scatter'; /*--X Axis--*/ xaxis grid; /*--Y Axis--*/ yaxis grid; run; ods graphics / reset;
This gives us the graph as below – pretty useless as we can’t see the individual sites, but it does allow us to see overall volumes and to get a sense of what’s considered a “high traffic” site.
To make the scatter plot useful, I’m going to limit the dataset to those sites who had more than 5,000,000 (again, this from the past 30 days, so 5 million users should give me a significantly smaller dataset).
Here’s my PROC SQL to generate the dataset:
proc sql; create table work.import2 as select * from work.import where users>5000000; quit;
When I run my scatter plot on work.import2 using the same X- and Y-variables, I get the following. Much more reasonable as I can now read the individual sites:
I don’t know what tools.usps.gov is. When I try and go to the site it says Server Not Found, so I assume you have to log in to get there. In any case, they have a huge number of visitors.
One and done or repeat visitors?
The next comparison I wanted to do was Users and Visits, to see if most people are going in only once during the 30 days or if there are sites people tend to go to repeatedly. Here’s how I set up my task:
And here are the results:
Because of limited space on the Y-axis, SAS has made a minor change to the formatting – the values are now in exponential format, where 4E7 means 4 to the 10^7 (or 40,000,000). Again, the tools.usps.gov is clearly the top of the pile – but it also appears most users just go in once. The forecast.weather.gov site however appears to have most visitors that view more than once, which makes sense. Knowing how often I check the weather, this doesn’t surprise me.
For the final analysis, let's look at average duration of a user’s session. Are they logging on and quickly leaving, potentially an indication of getting what they need quickly or realizing it’s the wrong site?
I wanted to limit my data to those people who stayed longer than 2,000 seconds; this indicate that these people have found what they’re looking for and spending an average of more than 30 minutes reading it. Or they haven’t found what they’re looking for and are determined to find it.
Here’s my code for the creation of the subdata:
proc sql; create table work.import2 as select * from work.import where avg_session_duration>2000; quit;
Here’s the scatterplot showing the results:
The usastaffing.opm.gov site has a large number of users but most spend about 30 minutes. On the other end of the spectrum, water.noaa.gov has a smaller number of users, but they spend significantly longer on the site, almost a full hour and a half. I would imagine that this site is limited to people working in meteorology, oceanography, etc. and possibly looking at satellite images or other types of data/documentation that require significant time to review.
So although I can't say for certain that any of these sites have anything to do with the election, I'm curious to see an American perspective on this data!
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.