Solved: Results from stratification not matching with interaction testing when...

wetman · Posted 12-27-2022 08:57 AM

Hi all,

I'm doing linear regression using proc surveyreg. I noticed that my interaction testing results doesn't agree with what is shown after stratification, and would appreciate if you could help pointing out what was wrong.

After simplifying, I have XV as X variable (1,2,3,...,10), YV as Y variable (numeric >= 0), AGG as age group (categorical), Sex (categorical), Ethics (categorical). INCLUDE (0 = people to be excluded in this analysis, 1 = people to be included in this analysis).

Below was the code I used to test for interaction between XV and Sex/Age Group/Ethics:

Proc Format;
	Value FMTX			1="A" 2="B" 3="C" 4="D" 5="E" 6="F" 7="G" 8="H" 9="I" 10="J";
	value Sex			1 = "Men"	2 = "Women";
	value Agegroup		1 = "<3 yrs"	2 = "3-5 yrs"	3 = "6-12 yrs"	4 = "13-19 yrs"		5 = "20-39 yrs"	6 = "40-64 yrs"	7 = "65-79 yrs"	8 = ">=80 yrs";
	value Ethics		1 = "Hispanic"	3 = "Non-Hispanic White"	4 = "Non-Hispanic Black"	5 = "Other";
Run;
Proc Surveyreg Data = DS nomcar;
	Title "XV*Sex - Interaction Testing";
	Strata STRA;
	CLUSTER PSU;
	Class XV(ref='A') Sex(ref='Men') AGG(ref='20-39 yrs'); 
	Weight FinalWeight;
	Domain INCLUDE;
	Model YV = XV Sex AGG XV*Sex / solution clparm vadjust=none;
	Format XV FMTX.; Format Sex sex.; Format AGG agegroup.;
Run; Quit;
Proc Surveyreg data=DS nomcar;
	Title "XV*Age Group - Interaction Testing";
	Strata STRA;
	CLUSTER PSU;
	Class XV(ref='A') Sex(ref='Men') AGG(ref='20-39 yrs'); 
	Weight FinalWeight;
	Domain INCLUDE;
	Model YV = XV Sex AGG XV*AGG / solution clparm vadjust=none;
	Format XV FMTX.; Format Sex sex.; Format AGG agegroup.;
Run; Quit;
Proc Surveyreg data=DS nomcar;
	Title "XV*Ethics - Interaction Testing";
	Strata STRA;
	CLUSTER PSU;
	Class XV(ref='A') Sex(ref='Men') AGG(ref='20-39 yrs') Ethics(ref='Hispanic'); 
	Weight FinalWeight;
	Domain INCLUDE;
	Model YV = XV Sex AGG Ethics XV*Ethics / solution clparm vadjust=none;
	Format XV FMTX.; Format Sex sex.; Format AGG agegroup.; Format Ethics Ethics.;
Run; Quit;

The interaction I get was:

Sex: P=0.048 < 0.05

Age Group: P=0.286

Ethics: P=0.008 < 0.05

And the below screenshot shows where I got the P value.

If my understanding is correct, the significant interaction observed for sex and ethics means if I stratify the population by sex/ethics in a linear regression model (y=b0+b1x), the slope rate b1 would be quite different between different stratas. But after statification the results I got doesn't seem so.

I conducted linear regression while stratifying sex/age group/ethics respectively. The code I used was:

Proc Surveyreg Data = DS nomcar;
	Title "Sex Stratified";
	Strata STRA;
	Cluster PSU;
	Class Sex(ref='Men') AGG(ref='20-39 yrs');
	Weight FinalWeight;
	Domain INCLUDE*Sex;
	Model YV = XV AGG / noint solution clparm vadjust=none;
	Format XV cycle.; Format Sex sex.; Format AGG agegroup.;
Run; Quit;
Proc Surveyreg data = DS nomcar;
	Title "Age Group Stratified";
	Strata STRA;
	Cluster PSU;
	Class Sex(ref='Men') AGG(ref='20-39 yrs');
	Weight FinalWeight;
	Domain INCLUDE*AGG;
	Model YV = XV Sex / noint solution clparm vadjust=none;
	Format XV cycle.; Format Sex sex.; Format AGG agegroup.;
Run; Quit;
Proc Surveyreg data = DS nomcar;
	Title "Ethics Stratified";
	Strata STRA;
	Cluster PSU;
	Class Sex(ref='Men') AGG(ref='20-39 yrs') Ethics(ref='Hispanic');
	Weight FinalWeight;
	Domain INCLUDE*Ethics;
	Model YV = XV Sex AGG / noint solution clparm vadjust=none;
	Format XV cycle.; Format Sex sex.; Format AGG agegroup.; Format Ethics Ethics.;
Run; Quit;

And the estimated slope rates and their 95% cofidence interval are demonstrated in the figure below:

Here is an example of where I got the estimate and 95% confidence interval:

The confidence interval in the figure for sex stratas don't look very apart but the interaction between X variable and sex is significant. The age group stratas, especially 40-64 yrs, are quite apart in the figure but the interaction between X variable and age group was not significant.

I kinda feel there's something wrong with what I did but not sure where. I've also attached the log and output in case needed. I've been stuck with this for more than a week. I'd appreciate if you could assist me with this. Thanks!

ballardw · Posted 12-30-2022 11:55 AM

If one group of YOUR data, we have no idea of the actual content of that data set, has drastically different values of model output then seems reasonable it may be time to delve into your data to see if there is something going on.

Did the proportion of male/female change for that age group? If there is some reason the proportion of gender changes that might affect a model.

Might there be some other confounding reasons affecting that age/gender combination more than others. Perhaps there is something actually related to age to going on. Consider something like "retirement". It might be that one gender in that age group may start retiring before 64 than the other. Or income may change more for one gender in that age group.

Depending on the data I might be tempted to look at different boundaries of age groups.

You may also want to investigate your sampling frame a bit. I know some geographic samples that might have very different results for age because of the proportion of ages is quite different in a location. Look up "Sun City, Arizona" for a moderately extreme case.

View solution in original post

ballardw · Posted 12-30-2022 04:03 AM

Can you explain exactly which bits you expect to "agree" in quite different models? And for which specific models?

Your question " I noticed that my interaction testing results doesn't agree with what is shown after stratification, and would appreciate if you could help pointing out what was wrong." seems to assume that you thing something is supposed to stay the same when you have different variables in your models.

You change domains meaning different records get included in the models as shown in the "Number of observations in domain".

Even with models that are similar changing the number of records from about 46000 to about 12000 would typically show differences and if one model is close to "significant" the other very well may not just from noise differences in the subsets used.

wetman · Posted 12-30-2022 10:44 AM

Thanks ballardw for your reply! For example, when I get the result that sex significantly interacts with X variable (P=0.048<0.05), it means being men/women signfificantly changes how Y variable changes when X variable changes. So am I correct to expect that when conducting linear regression on men/women subpopulation respectively (stratification), the slope rate (b1 in Y = b0 + b1*X) for men and women would be quite different? But in the stratified results I got they were quite close. And when looking at age group, the interaction was not significant, but the stratified slope rate for 40-64 yrs was so different from other age groups. That is what's not meeting my expectation.

To explain what I did with domain and why the difference in the record numbers: Every model in this example has used domain INCLUDE. And the results I picked were all from INCLUDE=1. This is to exclude population that I'm not interested in this study. When doing stratification, an additional domain variable was added, such as sex, age group or ethics. This is to stratify the population of interest, hence why the number of observation dropped.

Please correct me if I've got any misunderstanding. Thanks!

ballardw · Posted 12-30-2022 11:55 AM

If one group of YOUR data, we have no idea of the actual content of that data set, has drastically different values of model output then seems reasonable it may be time to delve into your data to see if there is something going on.

Did the proportion of male/female change for that age group? If there is some reason the proportion of gender changes that might affect a model.

Might there be some other confounding reasons affecting that age/gender combination more than others. Perhaps there is something actually related to age to going on. Consider something like "retirement". It might be that one gender in that age group may start retiring before 64 than the other. Or income may change more for one gender in that age group.

Depending on the data I might be tempted to look at different boundaries of age groups.

You may also want to investigate your sampling frame a bit. I know some geographic samples that might have very different results for age because of the proportion of ages is quite different in a location. Look up "Sun City, Arizona" for a moderately extreme case.

Results from stratification not matching with interaction testing when using proc surveyreg

Re: Results from stratification not matching with interaction testing when using proc surveyreg

Re: Results from stratification not matching with interaction testing when using proc surveyreg

Re: Results from stratification not matching with interaction testing when using proc surveyreg

Re: Results from stratification not matching with interaction testing when using proc surveyreg

Results from stratification not matching with interaction testing when using proc surveyreg

Re: Results from stratification not matching with interaction testing when using proc surveyreg

Re: Results from stratification not matching with interaction testing when using proc surveyreg

Re: Results from stratification not matching with interaction testing when using proc surveyreg

Re: Results from stratification not matching with interaction testing when using proc surveyreg

Ready to join fellow brilliant minds for the SAS Hackathon?