BookmarkSubscribeRSS Feed

In the Time of Cholera: Examining Correlations and Fit Lines with Drinking Water Data

Started ‎07-19-2018 by
Modified ‎07-19-2018 by
Views 1,996

I fear that I have been muddying the waters for some of my students when I discuss Fit Lines and Correlation Matrices. So I decided to use some data on clean drinking water to try to clear up any confusion.

 

I was recently talking to a friend about the novel Love in the Time of Cholera by Gabriel Garcia Marquez. The title implies that cholera is a thing of the past…the novel takes place in the late 1800s/early 1900s. But here we are in 2018, and I was amazed to learn that cholera still causes hundreds of thousands of deaths annually.

 

1.pnghttps://data.unicef.org/topic/child-health/diarrhoeal-disease/

 

The water-borne bacterium Vibrio cholerae is the sneaky little culprit that causes cholera. Drinking water contaminated with cholera can cause severe diarrhea and vomiting, leading to dehydration and shock. If left unchecked, this dehydration can cause some people to die within hours. Because cholera is carried in contaminated water, trying to rehydrate with more of that same contaminated water seems like a vicious cycle.

 

It was shocking to me to learn that diarrhea remains a leading cause of death among young children. Access to safe drinking water is the main way to prevent these deaths.

 

2.png

 

In places where drinking water and sewage are separated and treated, cholera outbreaks don’t happen. Vibrio cholerae can also be killed by boiling. Because dehydration is ultimately the cause of death, lives can also be saved by rehydrating with clean, electrolyte-laden water or beverages. Unfortunately, natural disasters like hurricanes, floods, and earthquakes, as well as human-created disasters like wars, can disrupt clean drinking water supplies, even in areas where clean drinking water infrastructure was previously fully functional.

 

Exploring Drinking Water Data

Let’s look at some drinking water and diarrhea data in Visual Analytics 8.2 on Viya 3.3. I use two data sets.  The first data set is from the World Health Organization/UNICEF Joint Monitoring Programme Global Database. The second is from the UNICEF Diarrhoeal Disease database.

 

3.jpg

 

Fit lines are available in SAS Visual Analytics from either a scatter plot or a heat map. Recall that a scatter plot is simply a plot of all of the observations with two variables; one variable is on the horizontal (x-axis) and one variable is on the vertical (y-axis). If you have too many observations, the scatter plot will not render because the points would be overlain over each other. Instead, a heat map can be used to visualize the data.

 

Using the drinking water data, I have taken a subset of countries (those on the African continent), and created a scatter plot of the percent of unimproved drinking water by the country’s population. With either a heat map or a scatter plot in Visual Analytics 8.2, you can open your Options pane and under Fit Line select None, Best Fit, Linear Fit, Quadratic, Cubic or PSpline as shown in the screen capture below.

 

4.png

 

Let me remind you of three types of fit lines that Visual Analytics 8.2 can create (we will save P-splines for another discussion and will not address those here):

 

Linear   y = β0 + β1x    (Straight line where β0 is y-intercept and β1 is slope of the line)

 

We can see below that the model type is "Linear" and the equation of the line is f(x) = 12.4041 +0.0766x.  f(x) is the dependent variable, which we also call y.  12.4041 is the y-intercept, which we can also approximate visually on the graph.  and 0.0766 is the slope.

 

5.png

 

Quadratic   y = β0 + β1x + β2x2   (A curved line, one point of inflection)

 

6.png

 

Cubic    y = β0 + β1x + β2x2 + β3x3  (A curved line with two points of inflection, “S-shape”)

 

7.png

 

A completely separate concept is how well the model fits the data. The R-square measures the fitness of the model, and ranges from 0 (doesn’t fit well) to 1 (fits perfectly). It is calculated as the sum of squares of the residuals divided by the total sum of squares:

 

8.jpg

 

where y-hat is the predicted value, y-bar is the mean, and y is an actual observation.

 

The Pearson Correlation Coefficient (r) is related to the R square of a linear fit line, but it is NOT related to the fit line equation itself. The Pearson Correlation Coefficient r ranges from –1 to 1:

  • 0 means not correlated at all
  • 1 means perfectly positively correlated
  • -1 means perfectly negatively correlated

In the following examples you see that although the fit lines are different (notice the very different slopes), in both cases the correlation coefficient = 1. That is, in both cases the x and the y variables are perfectly correlated (r=1), i.e., the data lie exactly on the model.

 

9.jpg Different fit line equations (notice the obviously different slopes), but same the correlation

 

Below we see the difference among positive correlation, negative (inverse) correlation and no correlation.

 

10.jpg

 

Below see strongly positively correlated (r=0.7) versus weakly positively correlated (r=0.3). Notice that the equation for the line would be the same! But the tightness of the fit around the line is different.

 

11.jpg

 

Below see strongly negatively correlated (r=0.7) versus weakly negatively correlated (r=0.3). Again the equation for these two lines would be the same, but the tightness of the fit around the line is different.

 

12.jpg

 

The Pearson correlation coefficient is calculated as follows:

 

13.jpg

 

View more information on calculating the Pearson correlation coefficient.

 

For our linear fit line, which is a simple linear regression with an intercept and just a single input variable, the R square is the square of the correlation coefficient. (~0.46 squared = 0.212). We can see this below when we open the details view of the scatter plot.  Notice the correlation coefficient has been rounded in the analysis table.

 

14.png

 

We can also see this correlation of 0.46 using a Correlation Matrix object.

 

15.png

 

For the quadratic fit line, we see have an improved R-square of 0.4484, meaning the quadratic equation fits the data somewhat better. But visually we can see that just 3 data points are very influential in driving that model. So we don’t necessarily want to bet the farm on it.

 

16.png

 

Finally, we see that the cubic line has a yet higher R-square of 0.4755. But again, given that we are only including three countries with high populations, we want to be careful about overfitting our data.

 

17.png

 

So I promised you we would talk about diarrhea. Hopefully, you are not approaching your lunch break. A correlation matrix shows that the percent Unimproved Drinking Water has a strong correlation (0.6342) with the Diarrhea Death Rate for children under age 5.

 

18new.png

 

We can also look at the detailed view of a scatterplot as shown in the graph below that depicts the diarrhea death rate (per 1000 children under age 5) by the percent of unimproved drinking water.

 

19_new.png

 

This positive correlation corroborates what we have read in the literature, that unimproved drinking water is correlated with diarrhea deaths in children under age five.

 

The Good News

The good news is that access to clean drinking water has improved in many places since 1900. Below from the WHO/UNICEF Joint Monitoring Programme 2017 report, we can see that worldwide, 71 percent of drinking water is safely managed.

 

20.png

 

But that means 29 percent could be better. Hopefully, this will continue to improve if we adult humans decide to make clean drinking water for every child in the world a priority.

 

References and Data Sources

Version history
Last update:
‎07-19-2018 11:08 AM
Updated by:
Contributors

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags