Another possible reason is BAD data.
Check the standard error of these two positive estimate coefficient , and see if it was very big .
The standard errors on the coefficients in the model are 0.00563 (rfs) and 0.00743 (pfs).
Some additional exploration of the that may help us figure this out:
I decided to run four separate models:
The results are:
So the second half of those were probably unnecessary since they are just the inverse of their counterparts in the first two of the models. All p-values < 0.0001. Standard error on either pfs is 0.00542 and on either rfs is 0.00478.
This is to show that, when tested independently, these are behaving as expected: increases in number of risk factors increases the probability of using substances, while increases in protective factors reduces this probability. The problem is arising when they are thrown into a model together.
This is to show that, when tested independently, these are behaving as expected: increases in number of risk factors increases the probability of using substances, while increases in protective factors reduces this probability. The problem is arising when they are thrown into a model together.
Please show a PROC FREQ of how the variables interact.
As you showed before the proc FREQ returns a p-value of <0.0001 so it seems that they are NOT independent.
Have you tried adding an interaction term?
Sorry I saw the request and was having trouble getting it all to show up (13x22 table problems). Hopefully this helps.
I am not familiar with adding interaction terms. Is this just running the model as sub1=rfs pfs rfs*pfs?
(edit: I cropped the images to allow them to be easier to read)
Alright adding the interaction term only we get:
sub1=-2.8998+0.2695(rfs)+0.0755(pfs)+(0.00576(rfs*pfs)
I tried putting rfs and pfs as classes and it brought up coefficients for each level of risk and protective factors (and with the rfs*pfs it then did a coefficient for each level of that). It was very messy.
With 18+ as a final category we get:
sub1=-3.1942+0.3049(rfs)+0.1179(pfs)
OR (with effects)
sub1=-2.9301+0.2738(rfs)+0.0787(pfs)+0.00529(rfs*pfs)
One idea that I am considering, could this be an effect of having it by observation? So instead, should I set it up so there is only one row where substance use=a(0 or 1) and risk factors =b (0-21) and protective factors=c (0-12) and then the frequency of responses where those are true? For example, if there are 12 people who said they use the substance who also reported 3 risk factors and 5 protective factors it would read (sub) 1, (rfs) 3, (pfs) 5, (n) 12. Right now each of those 12 people would be a separate observation in the data set.
@halkyos wrote:
Some additional exploration of the that may help us figure this out:
I decided to run four separate models:
- Substance use predicted by risk factors only
- Substance use predicted by protective factors only
- No substance use predicted by risk factors only
- No substance use predicted by protective factors only
The results are:
- Substance use 'yes'=-2.1091+0.2623(rfs)
- Substance use 'yes'=0.5846-0.0937(pfs)
- Substance use 'no'=2.1091-0.2623(rfs)
- Substance use 'no=-0.5846+0.0937(pfs)
So the second half of those were probably unnecessary since they are just the inverse of their counterparts in the first two of the models. All p-values < 0.0001. Standard error on either pfs is 0.00542 and on either rfs is 0.00478.
This is to show that, when tested independently, these are behaving as expected: increases in number of risk factors increases the probability of using substances, while increases in protective factors reduces this probability. The problem is arising when they are thrown into a model together.
When your x-variables are correlated with one another, this can sometimes cause the "wrong" sign to appear on one or more of your regression coefficients. What about outliers and clusters?
I am not seeing any clusters or outliers (this is what I was getting at with the distribution earlier). Taking a closer look, using PROC SGPLOT there are no outliers in the overall data, however when broken down into sub1=0 and sub1=1, there are a few outliers in sub1 for risk factors (rfs>15). There are no other outliers in protective factors for either group.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.