01-05-2016 09:19 PM - edited 01-05-2016 09:21 PM
Soft Skills Part 3 – Data Visualization, Tables and Graphics
There is a lot of information about proper techniques in data visualization; people such as Edward Tufte have devoted their lives to educating others on proper techniques for visualizing and displaying data. This article is a brief summary of my thoughts and feelings on data visualization; the next will be some tips and tricks I’ve learnt in SAS.
Data visualization is one of the areas in analytics where misrepresentation can be easily introduced, whether through a lack of understanding or through intentional manipulation of the data. Using an inappropriate type of graph, not having labels, or not providing an accurate description of the visualization (either through the title or through notes added to the graph) are all examples of ways a graph can be mishandled.
Being partially colour-blind, I find it hard to distinguish certain colours, especially those in certain colours. Shades of grey are also problematic for me, and so occasionally graphs that are in journal articles or other publications are hard for me to distinguish the different groups.
The image above is an example colour palette; I’ve highlighted a couple of sections where I personally am unable or find it challenging to distinguish the colours apart. Easier to distinguish are colours from different columns, rather than the same column; if the paper or presentation is being prepared for a black-and-white presentation/publication, or with the assumption that the reader will be printing out the graph(s) in black-and-white, patterns are easier for some people to tell apart.
Examples from Microsoft of patterns are shown below. Even on a printout, the groups would be easily distinguishable, and so may be preferable; patterns are particularly useful when combined with different colours as above, which would allow for people who are colour blind, people printing in black and white, and people in the audience to easily tell the difference between groups.
Identifying and labelling data points in a graph, or other identification features on the visualization (titles, legends, etc.), is more art than science. Labelling individual data points on a scatter plot where the labels overlap each other, and / or are obscured by the data themselves, can cause a critical and well-designed graph to become unreadable. If my scatterplot has a lot of data, but I need to show point-specific information, I focus on the outliers. Depending on my graph, I’ll add the labels to the top / bottom 10%, 10 items, or the extreme standard deviations. If I get a highly complex visualization or one with a lot of information, I run multiple iterations of the visualization – I play around with the labels, data point size, etc. to make sure I’m providing the absolutely best graph possible. Obviously this means that your preparation will take longer, and in a future post I’ll go through some of my thoughts on time management to get you to be the most efficient data analyst possible.
The process of graph selection is a highly contested topic, where people are highly passionate – more than once I’ve been in meetings where there was a very heated argument about the graphs that would be needed. I realise not everyone will agree with my opinions, and I would appreciate any (polite) discussion about the topic.
My friend Peter Flom has an article (http://www.statisticalanalysisconsulting.com/graphics-for-univariate-data-pie-is-delicious-but-not-n... that I love, and refer people to often. Human eyes are not designed to detect differences between shapes that not uniform; although there is a lot of great research available, the article here is particularly relevant and I draw your attention to Section 35.2, focussing on pie graphs. Box plots, bar charts, and scatter plots are more effective ways to display data and Peter’s article has great examples comparing the different types of graphs using the same data.
I know people use 3D graphs for a variety of reasons; I have a friend in finance who uses them for every presentation. Like pie graphs, 3D charts make visual comparisons extremely difficult – especially when the reference lines are not horizontal / vertical. One of the most difficult graphs I’ve seen in a presentation was an exploding 3D pie chart, where each segment was then sectioned into different categories (for example, one slice was “Shoes”, then a section within the slice for red, brown, blue, etc). I don’t know how the person even made this graph, but it was impossible to interpret, and therefore distracted the audience from any point (valid or not) that was being made.
As a someone who uses SAS, interpretation, analysis and presentation of data are integral parts of your job. Presenting your data and your analyses accurately, completely and appropriately is not just part of your job – it is your job. I will make the same recommendation as I did in my article on Integrity – find someone you can go to when you have a question, to review graphs, to bounce ideas off of. There are online communities (here, for one!) that have people more than willing to help – and I recommend you use them!
My next article will have specific examples of good and not-so-good graphs, and some graphing tricks I've picked up along the way. Until then, happy SAS-ing!
01-06-2016 10:06 AM