Anyone who has access to any sort of media, whether it’s social, TV, print, or radio, can't ignore that the U.S. presidential election is soon -- and unlike any other.
Being Canadian, my friends and I watch with intense fascination the mechanics behind this election – from the Electoral College to the different types vote-counting machines used across the nation. (In Ontario, we still use pencil and paper ballots). The U.S. has a vastly different process that takes much longer than Canadians experience.
I thought it would be interesting to use SAS University Edition to poke around freely available data from the U.S. presidential primary election and (egged on by certain friends) here's what I learned about the selections of Hillary Clinton and Donald Trump in the primaries.
Get the Data
I found a wide variety of datasets on the Election, from Internet memes to Twitter analysis, geospatial data to breakdowns by all sorts of demographics. I wanted to keep this simple, and found a great dataset on Kaggle – you can get the data here.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
I imported the data into SAS University Edition with the Import task, and had no issues. It has just over 24,000 rows and so will give us lots of data to play with.
The dataset is fairly straightforward – Party, Candidate, Votes, State, and County. So the first thing I’m going to do is create a simple bar chart looking at Party and the total number of votes by party and candidate.
When I run this, I find out a) I’d forgotten (or didn’t know) that some of these people ran and b) Republicans had significantly more candidates than the Democrats.
I don’t know if that’s a normal trend, if that’s the GOP way, or some other anomaly for this election, but as a data analyst, that’s interesting to me. It looks as though in the primaries, more people were hoping for a Republican outcome, but couldn’t agree on a single candidate.
Next, I want to see how the final two candidates fared; to do this I write a simple SQL query:
proc sql; select candidate, state, sum(votes) from work.import where candidate in ('Donald Trump', 'Hillary Clinton') group by candidate, state order by candidate, state
I’m selecting the variables I’m interested in, but limiting my results to just Trump and Clinton. As in previous articles on SQL, the Group By and Order By are needed when doing aggregate (summary) statistics like Sum.
In the above image, I’ve scrolled down part way to show you that although this is informative, it’s not easy to make comparisons. Ideally, we would have Clinton and Trump side-by-side, allowing for a more efficient review.
In order to do this, we have to make our SQL query slightly more complicated. I’m again selecting State, but this time I am doing a Sum based on specific criteria – that the Candidate is equal to ‘Donald Trump’ in the first case or equal to ‘Hillary Clinton’ in the second. I then do a sum of all votes.
proc sql; select state, sum(case when (candidate = 'Donald Trump') then votes end) as Trump, sum(case when (candidate = 'Hillary Clinton') then votes end) as Clinton, sum(votes) as Total from work.import group by state order by state; quit;
Now when I run my query, I get a much easier table to review and I can clearly see where there are significant differences.
I may want to look into Colorado, however, as it does appear to be odd that 0 votes in the entire state were for Trump. But this still isn’t quite ideal, as having just straight numbers makes comparisons possible, but not easy.
My final query is to take the counts, but also to bring in percentages – this will give a very clear picture of how everything fit together.
proc sql; select state, sum(case when (candidate = 'Donald Trump') then votes end) as Trump, sum(case when (candidate = 'Hillary Clinton') then votes end) as Clinton, sum(votes) as total, 100*sum(case when (candidate = 'Donald Trump') then votes end) / sum(votes) as Trump_Perc, 100*sum(case when (candidate = 'Hillary Clinton') then votes end) / sum(votes) as Clinton_Perc from work.import group by state order by state; quit;
You’ll notice that I’m keeping in the raw counts, as I like to see those, but I’m also calculating the percentages and calling them Clinton_Perc and Trump_Perc. These are the only two additions from the code above.
Now when I run this, I get this table:
Very easy to read, making comparisons by percentages simply a matter of going row by row.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.