Correlation analytics in SAS - Background and some theory..
In various data, there will be correlation between numeric variables. Some strong, some weak, some negative and others positive. This is the core of if, e.g. the degree of relationship between the variables and its direction.
With SAS it is easy to do correlation analytics using proc Corr. This procedure can produce a lot of output, but here we focus on the Pearson correlation values (PCC) and the matrix plots.
In SAS, the practical tool for correlation analysis is Proc Corr. It is a Base SAS procedure so it should run for all SAS installations. There are some requirements for getting a correlation analysis statistically correct. A few of them are:
Two or more numeric variables that are continuous, e.g. not discrete like a variable having only ON or OFF, 1 or 0.
Should have non-missing values (proc corr can omit these observations using option NOMISS)
The variables should be independent of each other, e.g. not a function of each other, like A=B+2 and be distributed in a similar manner.
Using proc corr, the data may have an ID, that can be used as by-variable, and the variables have its values as rows. Two columns is minimum, for example if correlation between the variables height and weight is done. Therefore, in some cases it is necessary to transpose the data to get it on this format below, same as for the GMO Vs Pirate example above. A by variable, like gender, can be added and used in the by-statement in proc corr.
height
weight
180
70
165
58
Correlation between numbers does not mean causation
First, it is important to remember that even though it can be a strong statistical correlation between two variables, it does not imply that the pattern between the variables is causing/explaining the other. That means that strong correlation not means causation between the variables. This is important to remember.
A fun fact example is the strong correlation between planted genetic modified cotton in Texas and global pirate attacks, see image below.
Here, the correlation is as strong as 0,948 so this must be the truth 🙂
To see the graph, data and an AI explanation of the correlation (!), go to: https://www.tylervigen.com/spurious/correlation/1746_gmo-use-in-cotton-in-texas_correlates-with_pirate-attacks-globally
More examples at: https://www.tylervigen.com/spurious-correlations
Note that there is a good section explaining why this becomes a perfectly good correlation (ref. requirements) and how to reproduce it yourself.
The SAS code to reproduce this example is below:
** Test data for GMO Vs. Pirate Attachs ;;
** Data from: https://www.tylervigen.com/spurious/correlation/1746_gmo-use-in-cotton-in-texas_correlates-with_pirate-attacks-globally ;;
Options validvarname=v7;
data work.gmo_pirates;
infile cards dlm=" ";
length var $30 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015 Y2016 Y2017 Y2018 Y2019 Y2020 Y2021 Y2022 8 ;
input var Y2009-Y2022;
cards;
GMO_cotton_Texas 15 13 18 11 8 4 5 4 5 3 3 5 2 2
Pirate_attacks_globally 410 445 439 297 264 245 246 191 180 201 162 195 132 115
;;
run;
** Add a label for variable VAR to become the label..;;
data work.gmo_pirates;
length label_var $40;
set work.gmo_pirates;
if var = "GMO_cotton_Texas" then label_var = "Planted GMO cotton % in Texas";
if var = "Pirate_attacks_globally" then label_var = "Number of Pirate Attacks";
run;
** We transpose the data to get the variables on the correct format (rowbased). Set name of transposed variable to year and uses label_var for label ;;
proc transpose data=work.gmo_pirates out=work.gmo_pirates_trans delimiter=_ name=year label=var ; * prefix=var_;
*by var ; ** NOTE: Using the option NOTSORTED will not transpose this in the same way as if it is sorted..;;
id label_var ;
*var <variables>;
run;
PROC CORR DATA=work.gmo_pirates_trans PLOTS=matrix(Histogram NVAR=all) PLOTS(MAXPOINTS=50000 ) NoMiss;
VAR _numeric_;
RUN;
Mitigate/check for the causality problem is important
To mitigate the causality problem, it is necessary to test the relationship and seek evidence for it, via other hypothesis and variables. This can be done by checking if other variable are equally good to explain a case, than the correlation found.
For example, it could be a correlation between exercising and skin cancer. The third variable here could be exposure of sunlight, which could be proven as a reason for skin cancer. And in our case, the correlation between exercising and cancer, could be due to doing much exercise in sunlight... (see this link for a post about correlation Vs causation, https://www.jmp.com/en_au/statistics-knowledge-portal/what-is-correlation/correlation-vs-causation.html)
Example of how to do correlation analysis using Proc Corr
To get a good view if correlation, it is useful to see both the Pearson correlation coefficients and a scatter plot of each variables. To do this, we need to add options like:
PLOTS=matrix(Histogram NVAR=all) PLOTS(MAXPOINTS=50000 ) NoMiss;
This means that we request a matrix of histogram plots for all variables in the var-statement. Further, that we override the default maximum of 5000 points, and uses 50000 instead. The NoMiss takes away observations with missing values.
The SAS codes then becomes simple:
libname sample "<path to where dataset sample_dataset_2014>";
** Standard proc corr showing Pearson correlation values and scatter plots and histograms ;;
ods graphics on;
PROC CORR DATA=sample.sample_dataset_2014 PLOTS=matrix(Histogram NVAR=all) PLOTS(MAXPOINTS=50000 ) NoMiss;
VAR weight height gender smoking sprint MileMinDur Math Reading SleepTime;
RUN;
The result looks like this:
Pearson correlation coefficients (PCC) below. Note that it is a matrix, so 2 numbers/squares are equal:
The PCC value needed to represent none/weak/strong correlation is a bit different from source to source. But a negative sign means negative/opposite correlation, a value between 0 and 0.3/0.4 is a week correlation, around 0.3-0.6/0.7 is medium correlation and from 0.6/0.7-1.0 is strong correlation. Some require 0.8 to be a strong correlation.
In our case, it is a strong correlation between short sprint (35 meter) and 1 MileMinRun (about 1609 meter run) with a PCC of 0,71.
But also good correlation between height and weight (0,57) and gender and MileMinRun (0,48) but less between gender and sprint (0,29). Note that gender is not continuous, so this may be a false correlation. Interestingly, Math and Reading is relatively strongly correlated (0,49).
Some like to see the same information as scatterplots, and look for narrow, increasing or decreasing patterns. Narrow and increase/decrease indicate a higher number of PCC.
Here we see the same information, and note the same "boxes" as the ones marked red above. Also see how binary variables looks, like "gender" and "MileMinRun" look like.
Some references and what Copilot says about correlation.
So, Proc Corr is a good tool for such analysis. To investigate further and repeat the results and code above, see the links:
Location of sample data and good information about proc corr: Data and info about Proc Corr
Proc Corr documentation: Documentation Proc Corr
Various odd correlations (from "Charts courtesy of Tyler Vigen"): Fun fact correlations
About correlation and causation: Discussion about correlation and causation
At the end, Artificial Intelligence (AI) is popular. To see what Edge browser "CoPilot" can tell you about correlation, go to Edge, press Co-Pilot icon in top right corner, and type the text "explain how statistical correlation works" and see a summary of the topic correlation.
... View more