Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Anyone who has access to any sort of media, whether it’s social, TV, print, or radio, can't ignore that the U.S. presidential election is soon -- and unlike any other.
Being Canadian, my friends and I watch with intense fascination the mechanics behind this election – from the Electoral College to the different types vote-counting machines used across the nation. (In Ontario, we still use pencil and paper ballots). The U.S. has a vastly different process that takes much longer than Canadians experience.
I thought it would be interesting to use SAS to poke around freely available data from the U.S. presidential primary election and (egged on by certain friends) here's what I learned about the selections of Hillary Clinton and Donald Trump in the primaries.
Get the Data
I found a wide variety of datasets on the Election, from Internet memes to Twitter analysis, geospatial data to breakdowns by all sorts of demographics. I wanted to keep this simple, and found a great dataset on Kaggle – you can get the data here.
Getting the data ready
I imported the data into SAS with the Import task, and had no issues. It has just over 24,000 rows and so will give us lots of data to play with.
The dataset is fairly straightforward – Party, Candidate, Votes, State, and County. So the first thing I’m going to do is create a simple bar chart looking at Party and the total number of votes by party and candidate.
When I run this, I find out a) I’d forgotten (or didn’t know) that some of these people ran and b) Republicans had significantly more candidates than the Democrats.
I don’t know if that’s a normal trend, if that’s the GOP way, or some other anomaly for this election, but as a data analyst, that’s interesting to me. It looks as though in the primaries, more people were hoping for a Republican outcome, but couldn’t agree on a single candidate.
Next, I want to see how the final two candidates fared; to do this I write a simple SQL query:
proc sql; select candidate, state, sum(votes) from work.import where candidate in ('Donald Trump', 'Hillary Clinton') group by candidate, state order by candidate, state
I’m selecting the variables I’m interested in, but limiting my results to just Trump and Clinton. As in previous articles on SQL, the Group By and Order By are needed when doing aggregate (summary) statistics like Sum.
In the above image, I’ve scrolled down part way to show you that although this is informative, it’s not easy to make comparisons. Ideally, we would have Clinton and Trump side-by-side, allowing for a more efficient review.
In order to do this, we have to make our SQL query slightly more complicated. I’m again selecting State, but this time I am doing a Sum based on specific criteria – that the Candidate is equal to ‘Donald Trump’ in the first case or equal to ‘Hillary Clinton’ in the second. I then do a sum of all votes.
proc sql; select state, sum(case when (candidate = 'Donald Trump') then votes end) as Trump, sum(case when (candidate = 'Hillary Clinton') then votes end) as Clinton, sum(votes) as Total from work.import group by state order by state; quit;
Now when I run my query, I get a much easier table to review and I can clearly see where there are significant differences.
I may want to look into Colorado, however, as it does appear to be odd that 0 votes in the entire state were for Trump. But this still isn’t quite ideal, as having just straight numbers makes comparisons possible, but not easy.
My final query is to take the counts, but also to bring in percentages – this will give a very clear picture of how everything fit together.
proc sql; select state, sum(case when (candidate = 'Donald Trump') then votes end) as Trump, sum(case when (candidate = 'Hillary Clinton') then votes end) as Clinton, sum(votes) as total, 100*sum(case when (candidate = 'Donald Trump') then votes end) / sum(votes) as Trump_Perc, 100*sum(case when (candidate = 'Hillary Clinton') then votes end) / sum(votes) as Clinton_Perc from work.import group by state order by state; quit;
You’ll notice that I’m keeping in the raw counts, as I like to see those, but I’m also calculating the percentages and calling them Clinton_Perc and Trump_Perc. These are the only two additions from the code above.
Now when I run this, I get this table:
Very easy to read, making comparisons by percentages simply a matter of going row by row.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.