11-18-2012 11:58 AM
I am new to logistic and GLM procedures, and therefore I have some syntactical and conceptual questions:
I have a dataset(attached to this post) which has information about the salary and various other important characteristics of all faculty (n=52) in a college. The descriptions of the variables are as follows:
OBS: observation #
SX: sex (0=Male, 1=Female)
RK: rank (1=Assistant Professor, 2=Associate Professor, 3=Full Professor)
YR: # years in current rank
DG: highest degree (0=Masters, 1=Doctorate)
YD: # years since highest degree earned
SL: academic year salary ($)
I need to determine if gender is associated with rank, highest degree, number of years in current rank, number of years since highest degree earned, and academic year salary.
Since my gender is a binary outcome, I have used logistic regression to address the question. However I am getting a result where all my predictors seem highly significant which does not look to be correct. Am I approaching this question correctly or is my syntax not correct? Should I be using GLM?
My code is as follows:
proc logistic data=discrimination;
class rk dg;
model sx(descending) =rk yr dg yd sl;
Another question that I am addressing is:
2. Is there a significant relationship between rank and academic year salary?
I am using a simple regression model. Here I have assigned rank as X (categorical) and salary as Y(continuous). Am I doing this correctly?
Below is the code:
proc reg data=discrimination SIMPLE;
model SL = rk;
Thanks in advance for your suggestions!
11-18-2012 09:42 PM
Actually I am also rookie of statistical theory. But I don't understand why you want use yd ,yr to be FREQ ? That couldn't be . And your code of logistic mode is also not look good, Did you check it more in the documentation ?
proc reg only be used to sequential data not categorical data ,therefore i think it is not a good idea .or you should try to use proc glm .
11-19-2012 07:37 AM
While this method may work (in the sense that you get a solution), I think you might have reversed the roles of independent and dependent variable, based on your statement "I need to determine if gender is associated with rank, etc.". I would think that you might want to just know if the average rank, number of years, etc. differ for males and females. Thus, for the ordinal responses (rank and highest degree), PROC FREQ would probably be the straightforward analysis. For the interval responses (YR, YD, YL) as the dependent variable, I would start with PROC GLM, but pay particular attention to the distribution of the residuals. If the residuals deviate a lot from normality (and I would use QQ plots to determine this rather than normality tests), I would move to a procedure that could capture the distribution, such as PROC GENMOD or PROC GLIMMIX.
11-19-2012 09:21 AM
In addition to Steve's comments, I would also caution that your sample size is extremely small (N=52) so you are unlikely to be able to do more than univariable analyses.
[The reason that everything was significant in your initial PROC LOGISTIC is the FREQ statements. The FREQ statement treats those variables as observation multipliers, so you effective sample size became many thousands instead of 52.]