BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
hpatel3
Obsidian | Level 7

Hello! 

 

I have a project at work where I'm being asked to analyze 7 variables with (5 of them being categorical(Yes/No) and 2 being numerical) 

and their correlation with disease status (Variable ADV_HF): 1= they have the disease / 0 or blank = they don't have it.

 

I have not used multivariate analysis before and the  different types are a little overwhelming. 

 

Based on the SAS forums, I am under the impression that I shouldn't use Proc Reg since I have categorical variables, so should I use Proc GLM or Proc Corr? Will it make a huge difference? 

 

What I've got so far is: 

 

proc glm data= Hetal.ES_Regression; 

Class Adv_HF;

model age_diag fam_Hx_ES fam_Hx_SD hx_sync LBBB EF_Reg AF_prior ; 

 

 

1.I'm not sure how or if to use the contrast statement and manova statement. 

2. How do I specify in the class statement that 1= disease state and 0- without disease?

3. Am I missing anything other key data step in this analysis? 

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User
And your issue with CORR is that it needs numeric variables so you should fix your data first and ensure that numeric variables are numbers. Categorical variables can be either numeric or categorical.

View solution in original post

17 REPLIES 17
PaigeMiller
Diamond | Level 26

I have a project at work where I'm being asked to analyze 7 variables with (5 of them being categorical(Yes/No) and 2 being numerical) and their correlation with disease status (Variable ADV_HF): 1= they have the disease / 0 or blank = they don't have it.

 

If the specific request you have is simply to analyze correlations, then you would use PROC CORR.

 

If the underlying reason is to fit a model, you should use PROC LOGISTIC (which is appropriate when your response variable is binary).

 

This is not (at least the way I use the word) a "multivariate" analysis, and no MANOVA would work here anyway. Multivariate would imply to me that you have multiple response variables, and if the multiple response variables are continuous, that is the only time when any MANOVA would work. So none of this applies to your situation, as I understand it.

--
Paige Miller
hpatel3
Obsidian | Level 7

Hi @PaigeMiller , thanks for responding. 

 

I decided to use Proc Corr with the following code: 

regression.JPG

and it keeps giving me this error  in my log:

 

error reg log.JPG

 

 

 

Why does it keep telling me  my variables do not match the type prescribed for this list? What am i doing wrong here 

Do i need to denote which are categorical? Also do i need to use the "BY" statement to classify that I want these variables compared with Those who have the disease (1) vs those who dont (0)?

Reeza
Super User
A really good tutorial is this one:
https://stats.idre.ucla.edu/sas/dae/logit-regression/

Note the use of the PARAM=REF option on the CLASS statement. You will want to do that. Additionally, check this example out:

https://documentation.sas.com/?docsetId=statug&docsetTarget=statug_logistic_examples02.htm&docsetVer...

hpatel3
Obsidian | Level 7

Thanks for replying @Reeza!

 

I don't think Proc logistic works in this case because  we're not looking for a specific question/outcome. Merely seeing if there is a correlation between the variables whether they have or don't have the disease. 

 

My variables are potential risk factors and we want want to see if there's any correlation between these and disease status.

 

So I think I'd use proc Corr, yes?

Reeza
Super User

Nope, you have a binary outcome variable so using PROC CORR is not suitable here. You really are looking for logistic regression here and the odds ratios, whether you do it one variable at a time or a full model. 

 


@hpatel3 wrote:

Thanks for replying @Reeza!

 

I don't think Proc logistic works in this case because  we're not looking for a specific question/outcome. Merely seeing if there is a correlation between the variables whether they have or don't have the disease. 

 

My variables are potential risk factors and we want want to see if there's any correlation between these and disease status.

 

So I think I'd use proc Corr, yes?


 

Reeza
Super User
And your issue with CORR is that it needs numeric variables so you should fix your data first and ensure that numeric variables are numbers. Categorical variables can be either numeric or categorical.
hpatel3
Obsidian | Level 7

It worked! I got results! (I Think) 

 

My only question left is about status. Does multivariate analysis take into account that for developing the disease (event status=1) vs not developing the disease (event status=0) need to be specified anywhere? Or does SAS automatically assume 0=no event and 1=event?? My code and results are as followed:

 

 

final regre code.JPGresults_regre.JPG

 

Reeza
Super User

For a univariate analysis with a single binary outcome, I would recommend ANOVA\T-Test for the continuous variables and chi square for categorical variables. 

 

For your output, you have Pearson correlation coefficients and SAS makes no assumptions regarding the value of 0/1 being a particular record type. You probably want Tau-b or Tau-c instead. 

https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient

 

 

 

 

PaigeMiller
Diamond | Level 26

PROC CORR can produce a correlation-like number called Kendall's Tau for categorical variables. However, they probably do need to be converted to category numbers (a numeric variable) in order for PROC CORR to process them.

--
Paige Miller
hpatel3
Obsidian | Level 7

@PaigeMiller , I converted to categorical variables and I do see Kendall's tau, but I figured pearson's  would be the number to look at. Why do you suggest Tau instead?

Should I be using pearson's on the continuous variables and Tau on the categorical ones?

PaigeMiller
Diamond | Level 26

Yes, as @Reeza pointed out, and I omitted, you want Kendall's Tau-B (or maybe it's a lowercase b) which is appropriate for computing a correlation-like measure when you have two ordinal variables. You use Pearson when you have two continuous variables.

 

I still have issues with your requirement that you want a measure like correlation when you have predictor (X) variables and response (Y) variables, correlation is not meant for that case; some measure of how well the X predict the Y is the appropriate statistic.

 

I can't agree with this statement from Reeza

For a univariate analysis with a single binary outcome, I would recommend ANOVA\T-Test for the continuous variables

 

This isn't correlation, which seems to be what you are asking for, although I don't understand why; and it also seems to reverse the role of X and Y. You don't do ANOVA or t-tests with binary Y, you do it for binary X. For continuous X variables, and binary Y, logistic regression is still what I would use, and the measure you want is the odds ratio or the slope of the logistic regression.

--
Paige Miller
Reeza
Super User
I would treat the 0/1 outcome as groups and checking for difference between the groups. So is the mean for the continuous variables different for those who have disease X and those who don’t. If that’s not appropriate for a t-test/ANOVA, what test would you use?
PaigeMiller
Diamond | Level 26

As I stated, you seem to have reversed X and Y. Logistic regression is what you use with binary Y.

 

But, the whole issue remains unclear as to what the original poster really wants to achieve, and so I think until he clarifies the situation, I'm going to pause here. On the one hand he wants correlation but on the other hand he was talking about fitting a model with PROC REG or PROC GLM.

--
Paige Miller
hpatel3
Obsidian | Level 7

@PaigeMiller @Reeza

 

After going back and reviewing, I don't think Proc reg or corr would give me the best results. I stated originally that I want correlation between variables and I still do. I thought Proc Corr/GLM/REG would give me that. That's why I was trying to use them. But someone on another thread stated that Proc Logistic would work for both continuous and categorical variables, so if I can just use that, that would work I believe? I don't need to fit the model or anything I believe.

 

So TO SEE IF THERE IS A CORRELATION between my risk factors and whether they develop the disease, I tried this:

 

libname Hetal "\\tuftsmc\home\hpatel3\SAS Datasets";
run; 



proc logistic data=hetal.es_regre;
class age_diag EF_reg Hx_Sync FHX_ES FHX_SD LBBB AF_prior * Hx_Sync FHX_ES FHX_SD LBBB AF_prior;
model adv_hf (event="1")= age_diag EF_REG Hx_sync FHX_ES FHX_SD LBBB AF_prior;
run;

 

 

 

However got this error:

 

SAS error.PNG

 

 

 

 

I'm not sure exactly what should be done based on this error. Is asking me to clarify numeric or character variables??? I'm not sure how to write this out in my code.

 

My class statement is: all variables * categorical variables <-- That is what I was supposed to do, correct?

 

And my model statement is: disease status variable = all variables  <--is this correct as well?

 

Thanks!

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 17 replies
  • 3751 views
  • 1 like
  • 3 in conversation