Another possible reason is BAD data.
Check the standard error of these two positive estimate coefficient , and see if it was very big .
The standard errors on the coefficients in the model are 0.00563 (rfs) and 0.00743 (pfs).
Some additional exploration of the that may help us figure this out:
I decided to run four separate models:
The results are:
So the second half of those were probably unnecessary since they are just the inverse of their counterparts in the first two of the models. All p-values < 0.0001. Standard error on either pfs is 0.00542 and on either rfs is 0.00478.
This is to show that, when tested independently, these are behaving as expected: increases in number of risk factors increases the probability of using substances, while increases in protective factors reduces this probability. The problem is arising when they are thrown into a model together.
This is to show that, when tested independently, these are behaving as expected: increases in number of risk factors increases the probability of using substances, while increases in protective factors reduces this probability. The problem is arising when they are thrown into a model together.
Please show a PROC FREQ of how the variables interact.
As you showed before the proc FREQ returns a p-value of <0.0001 so it seems that they are NOT independent.
Have you tried adding an interaction term?
Sorry I saw the request and was having trouble getting it all to show up (13x22 table problems). Hopefully this helps.
I am not familiar with adding interaction terms. Is this just running the model as sub1=rfs pfs rfs*pfs?
(edit: I cropped the images to allow them to be easier to read)
Alright adding the interaction term only we get:
sub1=-2.8998+0.2695(rfs)+0.0755(pfs)+(0.00576(rfs*pfs)
I tried putting rfs and pfs as classes and it brought up coefficients for each level of risk and protective factors (and with the rfs*pfs it then did a coefficient for each level of that). It was very messy.
With 18+ as a final category we get:
sub1=-3.1942+0.3049(rfs)+0.1179(pfs)
OR (with effects)
sub1=-2.9301+0.2738(rfs)+0.0787(pfs)+0.00529(rfs*pfs)
One idea that I am considering, could this be an effect of having it by observation? So instead, should I set it up so there is only one row where substance use=a(0 or 1) and risk factors =b (0-21) and protective factors=c (0-12) and then the frequency of responses where those are true? For example, if there are 12 people who said they use the substance who also reported 3 risk factors and 5 protective factors it would read (sub) 1, (rfs) 3, (pfs) 5, (n) 12. Right now each of those 12 people would be a separate observation in the data set.
@halkyos wrote:
Some additional exploration of the that may help us figure this out:
I decided to run four separate models:
- Substance use predicted by risk factors only
- Substance use predicted by protective factors only
- No substance use predicted by risk factors only
- No substance use predicted by protective factors only
The results are:
- Substance use 'yes'=-2.1091+0.2623(rfs)
- Substance use 'yes'=0.5846-0.0937(pfs)
- Substance use 'no'=2.1091-0.2623(rfs)
- Substance use 'no=-0.5846+0.0937(pfs)
So the second half of those were probably unnecessary since they are just the inverse of their counterparts in the first two of the models. All p-values < 0.0001. Standard error on either pfs is 0.00542 and on either rfs is 0.00478.
This is to show that, when tested independently, these are behaving as expected: increases in number of risk factors increases the probability of using substances, while increases in protective factors reduces this probability. The problem is arising when they are thrown into a model together.
When your x-variables are correlated with one another, this can sometimes cause the "wrong" sign to appear on one or more of your regression coefficients. What about outliers and clusters?
I am not seeing any clusters or outliers (this is what I was getting at with the distribution earlier). Taking a closer look, using PROC SGPLOT there are no outliers in the overall data, however when broken down into sub1=0 and sub1=1, there are a few outliers in sub1 for risk factors (rfs>15). There are no other outliers in protective factors for either group.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.