BookmarkSubscribeRSS Feed
Ksharp
Super User

Another possible reason is BAD data.

Check the standard error of these two positive estimate coefficient  , and see if it was very big .

halkyos
Obsidian | Level 7

The standard errors on the coefficients in the model are 0.00563 (rfs) and 0.00743 (pfs).

halkyos
Obsidian | Level 7

Some additional exploration of the that may help us figure this out:

 

I decided to run four separate models:

  1. Substance use predicted by risk factors only
  2. Substance use predicted by protective factors only
  3. No substance use predicted by risk factors only
  4. No substance use predicted by protective factors only

The results are:

  1. Substance use 'yes'=-2.1091+0.2623(rfs)
  2. Substance use 'yes'=0.5846-0.0937(pfs)
  3. Substance use 'no'=2.1091-0.2623(rfs)
  4. Substance use 'no=-0.5846+0.0937(pfs)

 

So the second half of those were probably unnecessary since they are just the inverse of their counterparts in the first two of the models. All p-values < 0.0001. Standard error on either pfs is 0.00542 and on either rfs is 0.00478.

 

This is to show that, when tested independently, these are behaving as expected: increases in number of risk factors increases the probability of using substances, while increases in protective factors reduces this probability. The problem is arising when they are thrown into a model together.

Reeza
Super User

This is to show that, when tested independently, these are behaving as expected: increases in number of risk factors increases the probability of using substances, while increases in protective factors reduces this probability. The problem is arising when they are thrown into a model together.

 

Please show a PROC FREQ of how the variables interact. 

As you showed before the proc FREQ returns a p-value of <0.0001 so it seems that they are NOT independent. 

Have you tried adding an interaction term?



halkyos
Obsidian | Level 7

Sorry I saw the request and was having trouble getting it all to show up (13x22 table problems). Hopefully this helps.

 

I am not familiar with adding interaction terms. Is this just running the model as sub1=rfs pfs rfs*pfs?

 

rfs x pfs_Page_1.pngrfs x pfs_Page_2.png

 

(edit: I cropped the images to allow them to be easier to read)

Reeza
Super User
Yes that is how your add an interaction term.
I would consider aggregating 18-21 as one term and it being 18+. I also wonder if you shouldn't be treating them as categorical variables, since it's not really a continuous measure.
Try adding the variables into the CLASS statements.
halkyos
Obsidian | Level 7

Alright adding the interaction term only we get:

 

sub1=-2.8998+0.2695(rfs)+0.0755(pfs)+(0.00576(rfs*pfs)

 

I tried putting rfs and pfs as classes and it brought up coefficients for each level of risk and protective factors (and with the rfs*pfs it then did a coefficient for each level of that). It was very messy.

 

With 18+ as a final category we get:

 

sub1=-3.1942+0.3049(rfs)+0.1179(pfs)

 

OR (with effects)

 

sub1=-2.9301+0.2738(rfs)+0.0787(pfs)+0.00529(rfs*pfs)

 

One idea that I am considering, could this be an effect of having it by observation? So instead, should I set it up so there is only one row where substance use=a(0 or 1) and risk factors =b (0-21) and protective factors=c (0-12) and then the frequency of responses where those are true? For example, if there are 12 people who said they use the substance who also reported 3 risk factors and 5 protective factors it would read (sub) 1, (rfs) 3, (pfs) 5, (n) 12.  Right now each of those 12 people would be a separate observation in the data set.

Reeza
Super User
Perhaps its also not true...and that those who have X amount of protective factors are less likely to have risk factors so really only one of those matters. And you're also lumping all factors together so the odds of each being the same weight isn't necessarily true.
PaigeMiller
Diamond | Level 26

@halkyos wrote:

Some additional exploration of the that may help us figure this out:

 

I decided to run four separate models:

  1. Substance use predicted by risk factors only
  2. Substance use predicted by protective factors only
  3. No substance use predicted by risk factors only
  4. No substance use predicted by protective factors only

The results are:

  1. Substance use 'yes'=-2.1091+0.2623(rfs)
  2. Substance use 'yes'=0.5846-0.0937(pfs)
  3. Substance use 'no'=2.1091-0.2623(rfs)
  4. Substance use 'no=-0.5846+0.0937(pfs)

 

So the second half of those were probably unnecessary since they are just the inverse of their counterparts in the first two of the models. All p-values < 0.0001. Standard error on either pfs is 0.00542 and on either rfs is 0.00478.

 

This is to show that, when tested independently, these are behaving as expected: increases in number of risk factors increases the probability of using substances, while increases in protective factors reduces this probability. The problem is arising when they are thrown into a model together.


When your x-variables are correlated with one another, this can sometimes cause the "wrong" sign to appear on one or more of your regression coefficients. What about outliers and clusters?

--
Paige Miller
halkyos
Obsidian | Level 7

I am not seeing any clusters or outliers (this is what I was getting at with the distribution earlier).  Taking a closer look, using PROC SGPLOT there are no outliers in the overall data, however when broken down into sub1=0 and sub1=1, there are a few outliers in sub1 for risk factors (rfs>15). There are no other outliers in protective factors for either group.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 24 replies
  • 1795 views
  • 0 likes
  • 4 in conversation