SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Every year thousands of people die in workplace incidents. The type of incident varies enormously – exposure to dangerous chemicals, road traffic accidents, assaults and so on. In this edition of Free Data Friday, we will be looking at Federal data to discover not only America’s most dangerous jobs but also its least dangerous and you may be pleasantly surprised by the conclusion from that second analysis….
You can get the data from the Bureau of Labor Statistics (BLS) web site by downloading it in XLSX format
Although XLSX files are easy to read into SAS using the XLSX Libname engine the layout of the downloaded file posed significant problems. You can see from the screen shot below that multiple tables are included on the same sheet and additionally the table that we will be using includes totals and partial splits of the totals which don’t add up to the totals themselves. In order to make the process easier I edited the spreadsheet to remove tables I didn’t want, delete rows for occupations other than Major Groups defined by in the Standard Occupational Classification codes (codes ending 0000) and also edited the column titles. The final sheet I used is an attachment to this article.
As I mentioned in the introduction, I want to discover both the most and least dangerous jobs and I also want to take into account the fact that some jobs have more workers than others and also work more hours than others. Obviously, the fatalities variable gives me the absolute number of fatalities for each occupational group while fatal_injury_rate is defined by the BLS as the number of fatalities per 100,000 full-time equivalent workers and is their recommended metric for making these comparisons.
Having identified the variables I want to use I first wrote a simple data step to strip out any residual spaces which may exist at the beginning and end of the occupation variable caused by formatting in the spreadsheet. Then, in order to get the top ten for each category I could sort the data set by each variable in turn by ascending then descending values. Considering that I have four variables that’s a lot of sorting. There is an alternative however – the Rank Procedure which allows the creation of rank values for multiple variables in one procedure call. For each call the ranked order must be for either ascending or descending values but it still only requires two calls as opposed to eight separate sorts. Here is the code run so far:
data fatality_rates;
set xl.cfoi_rates_2017hb;
occupation=strip(occupation);
run;
proc rank data=fatality_rates out=most_rankings descending ties=low;
var fatal_injury_rate fatalities;
ranks rate_rank fatality_rank;
run;
proc rank data=fatality_rates out=least_rankings ties=low;
var fatal_injury_rate fatalities;
ranks rate_rank fatality_rank;
run;
The output data sets look like this:
Now that I’ve got variables which allow me to extract the top ten for each category, I can create bar charts to display my results. In this case I’ve chosen horizontal bar charts as I feel that’s a better choice when displaying top n values in order. I’ve also added titles, a footnote giving the data source and some styling.
Here’s the code and output for the top ten most dangerous jobs, both by number of fatalities and fatality rate.
ods graphics / reset width=8in height=8in imagemap;
title1 'USA Fatal Injury Rate by Occupation';
title2 'Number of Fatalities';
title3 'Top Ten by Number of Fatalities';
footnote j=l 'Data from the Bureau of Labor Statistics';
proc sgplot data=most_rankings (where=(rate_rank<=10));
styleattrs backcolor=lightblue wallcolor=lightblue;
hbar Occupation / response=fatalities fillattrs=(color=gold) categoryorder=respdesc;
yaxis grid label="Occupation";
xaxis grid label="Number of Fatalities";
run;
ods graphics / reset width=8in height=8in imagemap;
title1 'USA Fatal Injury Rate by Occupation';
title2 'Rate per 100,000 Full Time Workers';
title3 'Top Ten Rates';
footnote j=l 'Data from the Bureau of Labor Statistics';
proc sgplot data=most_rankings (where=(rate_rank<=10));
styleattrs backcolor=lightblue wallcolor=lightblue;
hbar Occupation / response=Fatal_Injury_rate fillattrs=(color=gold) categoryorder=respdesc;
yaxis grid label="Occupation";
xaxis grid label="Fatal Injury Rate";
run;
We can see that while transportation and construction are at the top of the list in terms of absolute number of fatalities when we look at the fatality rate then farming, fishing and forestry occupations leapfrog them into top spot indicating it’s ranking as the most dangerous occupation in the USA.
As mentioned in the introduction though I also wanted to find the least dangerous jobs in America – here’s the code and output for those graphs:
ods graphics / reset width=8in height=8in imagemap;
title1 'USA Fatal Injury Rate by Occupation';
title2 'Number of Fatalities';
title3 'Top Ten by Least Number of Fatalities';
footnote j=l 'Data from the Bureau of Labor Statistics';
proc sgplot data=least_rankings (where=(rate_rank<=10));
styleattrs backcolor=lightblue wallcolor=lightblue;
hbar Occupation / response=fatalities fillattrs=(color=gold) categoryorder=respasc;
yaxis grid label="Occupation";
xaxis grid label="Number of Fatalities";
run;
ods graphics / reset width=8in height=8in imagemap;
title1 'USA Fatal Injury Rate by Occupation';
title2 'Rate per 100,000 Full Time Workers';
title3 'Top Ten by Lowest Rates';
footnote j=l 'Data from the Bureau of Labor Statistics';
proc sgplot data=least_rankings (where=(rate_rank<=10));
styleattrs backcolor=lightblue wallcolor=lightblue;
hbar Occupation / response=Fatal_Injury_rate fillattrs=(color=gold) categoryorder=respasc;
yaxis grid label="Occupation";
xaxis grid label="Fatal Injury Rate";
run;
While computer and mathematical occupations and legal occupations are tied for first place in number of fatalities, when you then look at the fatality rate, computer and mathematical occupations are the clear leader as the safest jobs in the country.
Of course, this includes SAS professionals so, if you’re reading this article you can relax safe in the knowledge that you work in one of the safest jobs in the country!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Visit [[this link]] to see all the Free Data Friday articles.
You could do the rankings in a single pass with no explicit sorting using hash objects or arrays in a data step. I wonder which of the three methods would be faster. Hash objects might be very fast with the right programming.
Hi Jack - you may very well be right - I've used hash objects a lot for merging very large files and my experience is they're super efficient for that, but the merge statement wins when smaller files are used. I suspect it might be the same with sorting/ranking but I'd need to do some experiments to prove that. Meanwhile Proc Rank has the benefit of simplicity 🙂
Thanks for posting this. Interesting stuff here, nicely explained. Of course, most programming is done sitting, as is most typing. At about 1,800 work hours per year, that's potentially over 70,000 sedentary hours in a 40 year career, which can be lethal in other ways. But that's a subject for a different post 🙂
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.