BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
David_M
Obsidian | Level 7

I am performing  simple Pearson, Spearman and Kendall correlations between many variables using PROC CORR.

 

1.  I need to suppress non-significant correlations (p > 0.05) in my matrix, leaving only significant correlations (p <0.05). 

2. I need to show confidence intervals for these correlations.

 

How can I do this?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ

Use the FISHER option, save the corresponding table (see this note) using a WHERE clause to filter out the nonsignificant p-values, then print it. For example, using the data set from the Getting Started section of the PROC DISTANCE documentation:

proc corr data=protein fisher; 
  var _numeric_;
  ods output fisherpearsoncorr=corrs(where=(pvalue<.05));
  run;
proc print noobs; 
run;

 Also, see this note for how to present a heat map visualization of a correlation matrix. that makes significant values stand out.

View solution in original post

19 REPLIES 19
StatDave
SAS Super FREQ

Use the FISHER option, save the corresponding table (see this note) using a WHERE clause to filter out the nonsignificant p-values, then print it. For example, using the data set from the Getting Started section of the PROC DISTANCE documentation:

proc corr data=protein fisher; 
  var _numeric_;
  ods output fisherpearsoncorr=corrs(where=(pvalue<.05));
  run;
proc print noobs; 
run;

 Also, see this note for how to present a heat map visualization of a correlation matrix. that makes significant values stand out.

David_M
Obsidian | Level 7

Thank you ... how can I adapt the code to handle all numeric and categorical variables? I have continuous, ordinal, binary and nominal variables I need to perform the aforementioned  correlation tests on and generate the two Spearman and Kendall tau-b correlation matrices. My  apologies, but I'm very new to SAS.

 

Thanks.

David_M
Obsidian | Level 7

Ok, with your help, I came up with the following code:

ods select JS_Corr;

/* Perform Spearman’s correlation for Job_Satisfaction and Respect at Work with potential confounders */ proc corr data=F0_Data fisher spearman; var Job_Satis Resp_Work; with _all_; ods output fisherspearmancorr = JS_Corr(where=(pvalue<.05)); /* Save the correlation table parameters with p<0.05 to JS_Corr */ run; ods pdf file="JobSatis_Resp@Work Confounders.pdf"; Title "Job Satisfaction - Respect at Work Potential Confounders";

Need to get rid of the X variablesNeed to get rid of the X variables

It seems to produce the desired output of finding variables that pair with both Job_Satis and Resp_Work an have a p <0.05, which is the desired outcome, except for these two issues:

 

1. How do I get rid of the variables coded as X ? They do not the desired requirement as those shown in the green boxes? A variable in the WithVar column has to be statistically significant with both Job_Satis (outcome) and Resp_Work (predictor), hence a confounding variable. Otherwise do not include in display table or JS_Corr array or anywhere.

 

2. How can I save the correct possible confounding variables (ordinals and nominals) in the green box to another array as distinct and not duplicated variables? I will be performing some analysis to reduce collinearity amongst these variables for a later ordinal logistic regression modelling. Either PCA (?) VIF (Polychoric ?) procedures are the candidates for now.

@Rick_SAS , @Ksharp , @PaigeMiller 

 

Seriously, thank you!!!

 

Ksharp
Super User

1)For your first question:

proc corr data=sashelp.heart(obs=400)  fisher(biasadj=no) spearman;
	var weight height;
	with _numeric_;
	ods output fisherspearmancorr = JS_Corr; 
run;

proc sql;
create table want as
select *
 from JS_Corr
  group by WithVar 
   having sum(.<pvalue<.05)=2;
quit;

P.S. if your data is very big ,you would end up with all obs have p-value<0.05 .

 

2)For your second question, you could create a macro variabe to include all these variable name. and use it in the next proc logistic code:

proc sql noprint;
select distinct WithVar into :vnames separated by ' ' from want ;
quit;

%put &=vnames. ;


proc logistic .........;
model y= &vnames. ;
run;

 

 

P.S. To know the collinearity between variables, I would like to use option CORRB if you are performing LOGISTIC Model:

proc logistic data=sashelp.heart;
model sex(event='Male')=_numeric_/corrb;
run;

Ksharp_0-1748138520575.png

 

David_M
Obsidian | Level 7

Thank you, @Ksharp  ... I believe I understand your first solution, but not the second one. Would you mind explaining its logic? Thanks again.

Ksharp
Super User

My suggestion for second one is using CORRB option,and remove thoese variable have collinearity, not listing these variables in PROC LOGISTIC.

 

proc logistic data=sashelp.heart;
model sex(event='Male')=_numeric_/corrb;
run;
StatDave
SAS Super FREQ

I think the easiest way to do this is to create separate data sets of the significant correlations for each of your two primary variables and then merge them together, keeping the correlations that are on the same WITH variable.  However, you are at risk of the common problem of multiple testing and its effect on your overall error rate. Because you are testing many correlations, presumably each at 0.05 level, the error rate over all of these tests is much higher than 0.05. This can be handled by using a p-value adjustment method. You can use PROC MULTTEST to do this. There are many adjustment methods available, but one that is generally acceptable is Holm's stepdown Bonferroni method which is much less conservative than the simple well-known Bonferroni method. The entire process is illustrated in the example below using the wine cultivar data that can be found in the PROC HPSPLIT documentation in the SAS/STAT User's Guide - see the Getting Started example. In the code below, Alcohol and Malic are treated like your two primary variables. Correlations with adjusted p-values less than 0.05 are kept. 

 

proc corr data=wine fisher spearman;
var alcohol malic;
with _numeric_;
ods output fisherspearmancorr = corr;
run;
proc multtest inpvalues(pvalue)=corr holm;
ods output pvalues=adjp;
run;
data alc(where=(var='Alcohol' and stepdownbonferroni<.05)) 
     mal(where=(var='Malic' and stepdownbonferroni<.05));
merge corr adjp;
run;
proc sort data=alc; by withvar; run;
proc sort data=mal; by withvar; run;
data done;
merge alc(rename=(var=var1 corr=corralc raw=pvalalc stepdownbonferroni=adjpalc))
      mal(rename=(var=var2 corr=corrmal raw=pvalmal stepdownbonferroni=adjpmal));
by withvar;
drop zval--ucl;
if var1 ne ' ' and var2 ne ' ' then output;
run;
proc print;
id withvar;
var var: corr: adj:;
run;
David_M
Obsidian | Level 7

Thank you for this...I'll try to adapt it to my code. Actually, I tried a similar methods first, but then got stuck after the merging about how to extract the data I needed and I gave up on it after 2 days of frustration. Thanks...will give update soon.

Rick_SAS
SAS Super FREQ

>I am performing  simple Pearson, Spearman and Kendall correlations between many variables using PROC CORR.

How many variables do you have? For N variables, there are N*(N-1)/2 pairwise correlations, most of which will be significant. For example, if you have N=100 variables, it is likely you will encounter as many as ~5000 significant correlations. On the other hand, if you have N=10 or N=15, then you can visualize the correlations.

 

So, how many variables and what do you want to do with the information about correlations? 

David_M
Obsidian | Level 7

Thanks @Rick_SAS . You guessed it.  I have a total of 100 variables (of all types) that most will not be significant, hence the need to sort them out by their p-values. I do not want to have a 100x100 matrix table that I would then manually sort out the statistically significant variables in Excel. Did it before on a 50x50 matrix from another dataset generated in SPSS (of which < 50 were significant), though I had hoped SAS would be more streamlined about this.

Rick_SAS
SAS Super FREQ

I think the task you propose is hard and it's not clear what information it would provide. Again, what do you intend to do with these thousands of numbers? What is the scientific result you want to find? The reason I ask is that there might be a better way to conduct your analysis.  For example, if your goal is to reduce the number of variables in a model by eliminating highly correlated variables, there are better ways to implement that task.

 

Another question: You say, "I have continuous, ordinal, binary and nominal variables."  

The statistic you need to use depends on the type of the variable. Is there some naming convention that enables you to determine the type of the variable? For example, are the variables named CONT1-CONT20, ORD1-ORD25, and NOM1-NOM10? 

 

In general, you would use a different measure of association for the various combination of variables:

  1. Cont-Cont : usually Pearson correlation
  2. Cont-Ord : Recode Ord as 1,2,..,k and use Kendall tau-b
  3. Cont-Nom : This is tough, You have to recode the levels and use something called the point-biserial correlation. I've never done this.
  4. Ord-Ord : Kendall tau-b
  5. Ord-Nom : rank-biserial correlation. I've never done this.
  6. Nom-Nom : PROC FREQ provides several statistics of ASSOCIATION

I do not think many people try to compute these statistics all at once because you cannot easily compare one method to another. For example, a Kendall tau-b score of 0.5 does not equate to a Pearson correlation of 0.5. I would not know how to compare the various statistics to each other.

 

David_M
Obsidian | Level 7

Thanks ... my actual goal is to identify confounder variables in my dataset governing the relationship between Job Satisfaction (Y outcome, Ordinal, Likert 1 - 5) and WorkRespect (X Predictor, Ordinal, Likert 1 - 4). The characteristics of a confounder C are:

1) Y and C are statistically related (p <0.05)

2) X and C are statistically related (p <0.05)

3) C cannot be in the path between  X and Y, otherwise it becomes a mediator. 

 

C (my dataset) is mostly ordinal and nominal, but binary and continuous variables exist.

 

I am trying identifying variables in my dataset that meet the first two criteria, hence the original goal for this post. Used SPSS, as I said earlier on a smaller set, then ported the files over into excel and did some painful excel math logic, which I wanted to avoid now, on this much larger dataset. BTW, I compared Spearman and Kendall results form the first two criteria and they were within 10% of each each. Both gave the same number of variables that meet criteria #1 and #2.

 

How to determine the 3rd criteria is currently mystery to me now, beyond just drawing Directed Acyclic Graphs (DAGs) and using "strategic common sense" as a colleague told me, to determine whether C truly comes before X or simply does not belong. Literature searches so far have not been that helpful.

 

Anyways, to preventing bloating this thread any further, I wanted to concentrate on SAS code that helps meet first two criteria. Hope this makes sense and my apologies for any confusion on my part.

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 19 replies
  • 2359 views
  • 15 likes
  • 5 in conversation