I am working with risk and protective factor data for outcomes regarding substance use. My data is arranged so that I have an outcome as a binary variable (0=no use, 1= use), the total number of risk factors and the total number of protective factors. Risk factors are known to increase the likelihood of an outcome occurring and protective factors are known to have an opposite effect. Examination in PROC FREQ shows that the proportion of observations using a substance increases with the number of risk factors and decreases with the number of protective factors. When I use PROC LOGISTIC though to write a model, I am getting a positive effect from my protective factors. Here is my code:
PROC LOGISTIC DATA=survey DESCENDING;
MODEL sub1= rfs pfs;
RUN;
sub1: binary variable where 1= using the substance and 0=not using the substance.
rfs: total number of risk factors.
pfs: total number of protective factors.
My results for one substance are giving me a model of p(1)=-3.1860+0.3033(rfs)+0.1181(pfs). As a researcher I know that this is wrong, I don't have anomalous data where the population is more likely to use substances if they have more protective factors, but I am having trouble figuring out how to correct this.
Here is my log:
306 PROC LOGISTIC DATA=survey DESCENDING;
307 MODEL sub1=rfs pfs;
308 RUN;
NOTE: PROC LOGISTIC is modeling the probability that sub1=1.
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
NOTE: There were 14445 observations read from the data set WORK.SURVEY.
NOTE: PROCEDURE LOGISTIC used (Total process time):
real time 0.25 seconds
cpu time 0.15 seconds
rfs and pfs are positive whole integers unless they are 0 representing the number of risk or protective factors present. rfs ranges from 0-21 and pfs ranges from 0-12.
One reason this can occur is if your two x-variables rfs and pfs are highly correlated with each other. Another reason this can occur is if you have outliers or clusters in rfs and/or pfs.
I just tested for these possibilities: A chi-square test for independence indicates that rfs and pfs are independent. When entered into PROC AUTOREG for rfs=pfs and pfs=rfs the values are negatively correlated to each other:
chi-sq: 4876.6114, p<0.0001.
rfs=12.9714-0.7614(pfs), p<0.0001.
pfs=8.7403-0.2989(rfs), p<0.0001.
The overall distributions are almost textbook normal, and when stratified to whether or not the observation reported substance use the distribution of rfs for non substance users takes on a right-tail skew. All other distributions remain normal.
@halkyos wrote:
I just tested for these possibilities: A chi-square test for independence indicates that rfs and pfs are independent. When entered into PROC AUTOREG for rfs=pfs and pfs=rfs the values are negatively correlated to each other:
chi-sq: 4876.6114, p<0.0001.
rfs=12.9714-0.7614(pfs), p<0.0001.
pfs=8.7403-0.2989(rfs), p<0.0001.
The overall distributions are almost textbook normal, and when stratified to whether or not the observation reported substance use the distribution of rfs for non substance users takes on a right-tail skew. All other distributions remain normal.
What is the correlation (not the auto-correlation from PROC AUTOREG but the correlation from PROC CORR) between rfs and pfs? Distribution of your x-variables is irrelevant here. Are there outliers or clusters among your x-variables?
My PROC CORR results are as follows:
There are no high or low outliers for either variable.
Change your response value which model the prob ,and you get the different result
PROC LOGISTIC DATA=survey ; MODEL sub1(event='0') = rfs pfs;RUN;
V.S.
PROC LOGISTIC DATA=survey ; MODEL sub1(event='1')= rfs pfs;RUN;
So I tried this before coming onto here, what is does is switches which of the two has a larger positive coefficient, but both remain positive. My office is renewing my license today so I can't currently give you the exact coefficients, but what happens is it becomes sub1=y+pfs+rfs where the coefficient of pfs> coefficient of rfs; 0<= either coefficient <= 1.
Did you Check the standard error of these two coefficient ?
The standard errors are as follows:
rfs: 0.0417
pfs: 0.0261
What is the correlation (not the auto-correlation from PROC AUTOREG but the correlation from PROC CORR) between rfs and pfs?
Are there outliers or clusters among your x-variables?
PROC CORR is new to me, but looking at the guide on that one it seems pretty straightforward. I ran:
PROC CORR DATA=survey;
VAR rfs pfs;
RUN;
My results are:
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.