I am dealing with a few dozen variables in the Proc Corr procedure. I used the array function with the colon to specify the variables. The variables have the same prefix, but the number following them is not sequential. It's just a 5 digit number. I need a way to filter out small and insignificant correlation values. Is there a way to do this in the procedure itself?
I've tried outputting the correlation data set so I can filter there, but the WHERE clause in PROC PRINT and PROC SQL does not appear to support the : array operand.
*calculate correlation matrix for the data;
ods output PearsonCorr=Corr_P;
PROC CORR DATA=sashelp.cars;
VAR MSRP Horsepower MPG_City MPG_Highway Weight;
RUN;
*sort for transposes;
proc sort data=Corr_P;
by VARIABLE;
run;
*restructure data so that it's in a long format for graphing;
*need to transpose the correlation and p-values separately;
proc transpose data=Corr_P out=CorrLong1(rename=(COL1=Correlation))
name=CorrelationID;
var MSRP Horsepower MPG_City MPG_Highway Weight;
by Variable;
run;
proc transpose data=Corr_P out=CorrLong2(rename=(COL1=p_value)) name=PvalueID;
var PMSRP PHorsepower PMPG_City PMPG_Highway PWeight;
by Variable;
run;
*merge data sets to get p-value and correlation in right places;
data CorrLong;
merge CorrLong1 CorrLong2(drop=PvalueID );
by Variable;
LABEL CorrelationID="Correlations"; run;
*sort for graphing;
proc sort data=CorrLong;
by VARIABLE CorrelationID;
run;
*create a heat map graph with P-Values in the squares;
proc sgplot data=CorrLong noautolegend;
heatmap x=Variable y=CorrelationID / colorresponse=Correlation name="nope1" discretex discretey x2axis colormodel=ThreeColorRamp; *Colorresponse allows discrete squares for each correlation. x2axis bring the label to the top;
text x=Variable y=CorrelationID text=p_value / textattrs=(size=10pt) x2axis name='nope2'; /*To overlay significance, create a variable that contans that info and set text=VARIABLE */
label correlation='Pearson Correlation';
yaxis reverse display=(nolabel);
x2axis display=(nolabel);
gradlegend;
run;
Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html
For your question you may have to provide more specific details.
First it may help to show the actual code you used. That lets us know all of the options and the likely layout of the output data set(s).
Second what to do you mean by "filter out small"?
Most of the output sets from Proc Corr are in a specific form that likely needs to be transposed in some form to get what you want. You may consider using one of the small SAS supplied data sets, such as SASHELP.CLASS to generate output and then tell us what the rules with that output would be to get what you want.
First, use PROC CORR and the OUTP= option to get the correlations in a data set.
Then convert the correlations from wide to long format.
You can then use the WHERE clause to eliminate the small correlations. If the following isn't exactly what you want, it should point you in the right direction:
data Have(keep=A:);
set sashelp.cars(
rename=(
MSRP=A123
Invoice=A432
EngineSize=A847
Cylinders=A9212
Horsepower=A23
MPG_City=A439
MPG_Highway=A8736)
);
run;
proc corr data=Have outp=CorrOut noprint;
var A:;
run;
proc print data=CorrOut;
run;
data Want;
set CorrOut(rename=(_NAME_=Var1) where=(_TYPE_='CORR'));
length Var2 $32.;
array A[*] A:;
do j = 1 to dim(A);
Var2 = vname(A[j]);
Corr = A[j];
output;
end;
keep Var1 Var2 Corr;
run;
%let cutoff = 0.5;
proc print data=Want;
where abs(Corr) > &cutoff AND Corr^=1;;
run;
Perhaps you would like techniques near the bottom of this post.
Hi @RandoDando
As proposed by @WarrenKuhfeld , graphic solutions can be vey useful to identify most correlated variables.
E.g.:
proc format;
value CorrSignif -0.0-<0.2 = "red"
0.2-<0.4, -0.4-<-0.2 = "orange"
0.4-<0.6, -0.6-<-0.4 = "yellow"
0.6-<0.8, -0.8-<-0.6 = "lightgreen"
0.8-<1.0, -1.0-<-0.8= "forestgreen"
1, -1 = "White";
run;
proc corr data=Have outp=CorrOut (where=(_type_='CORR')) noprint;
run;
proc print data=CorrOut (drop=_type_ rename=(_name_=Variable))
style(column)={backgroundcolor= CorrSignif.} noobs;
run;
Best,
*calculate correlation matrix for the data;
ods output PearsonCorr=Corr_P;
PROC CORR DATA=sashelp.cars;
VAR MSRP Horsepower MPG_City MPG_Highway Weight;
RUN;
*sort for transposes;
proc sort data=Corr_P;
by VARIABLE;
run;
*restructure data so that it's in a long format for graphing;
*need to transpose the correlation and p-values separately;
proc transpose data=Corr_P out=CorrLong1(rename=(COL1=Correlation))
name=CorrelationID;
var MSRP Horsepower MPG_City MPG_Highway Weight;
by Variable;
run;
proc transpose data=Corr_P out=CorrLong2(rename=(COL1=p_value)) name=PvalueID;
var PMSRP PHorsepower PMPG_City PMPG_Highway PWeight;
by Variable;
run;
*merge data sets to get p-value and correlation in right places;
data CorrLong;
merge CorrLong1 CorrLong2(drop=PvalueID );
by Variable;
LABEL CorrelationID="Correlations"; run;
*sort for graphing;
proc sort data=CorrLong;
by VARIABLE CorrelationID;
run;
*create a heat map graph with P-Values in the squares;
proc sgplot data=CorrLong noautolegend;
heatmap x=Variable y=CorrelationID / colorresponse=Correlation name="nope1" discretex discretey x2axis colormodel=ThreeColorRamp; *Colorresponse allows discrete squares for each correlation. x2axis bring the label to the top;
text x=Variable y=CorrelationID text=p_value / textattrs=(size=10pt) x2axis name='nope2'; /*To overlay significance, create a variable that contans that info and set text=VARIABLE */
label correlation='Pearson Correlation';
yaxis reverse display=(nolabel);
x2axis display=(nolabel);
gradlegend;
run;
Thank you @Reeza . I have 49 variables in my correlation matrix, so I had to adjust the size of the graph in the ods settings, as well as the font size for the correlation values. It's big and it's busy, but it's only being used for analytical purposes, and it tells me what I need to know. Much easier than looking through a matrix without any indicators to highlight higher values.
I see this has an accepted solution, but I'll just comment that if all you want is information about the largest correlation coefficients you can use the RANK option in the PROC CORR statement to request the correlation coefficients be displayed in order from highest to lowest absolute value, or the BEST=n option which requests an ordered display of only the n largest correlation coefficients for each variable. In both these cases, each column of the display no longer corresponds to the same variable, but instead is the "Bestn" variable that gives the nth largest correlation coefficient for that row variable.
The CORRELATION procedure is part of SAS Viya. It is included as part of SAS Visual Statistics.
Hi! Thanks for this answer, it was very helpful.
I have a question: I want to use the Pearson values in the lower triangle, and p-values in the upper triangle, how I would do this?
Thanks!
Luan.
Modify the data set before it goes to the graphing procedure.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.