The Tobias abstract began saying, “Partial least squares is a popular method for soft modelling in industrial applications.” The referenced uses of PLS included econometrics, neural networks, chemometrics, and social science. A 2013 paper titled, “Evaluation of methodologies for assessing the overall diet: dietary quality scores and dietary pattern analysis” discussed three main approaches to study the overall diet—(1) dietary guidelines based, (2) principle component analysis or cluster analysis, (3) reduced rank regression. (https://pubmed.ncbi.nlm.nih.gov/23360896/). PLS was not mentioned. However, a 2017 paper titled, “An application of for identifying dietary patterns in bone health” did use PLS. They introduced PLS saying, “Partial least-squares (PLS) is a data-reduction technique for identifying dietary patterns that maximizes correlation between foods and nutrients hypothesized to be on the path to disease, is more hypothesis-driven than previous methods, and has not been applied to the study of dietary patterns in relation to bone health.” (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5506508/#:~:text=Partial%20least%2Dsquares%20(PLS),patt.... We are not researching dietary patterns that maximize correlations between foods and nutrients. We are modeling the dietary and other risk factors for BMI worldwide.

dkcundiffMD · Posted 12-19-2020 12:52 AM

In a worldwide Global Burden of Disease database of 195 countries, my co-author and I are modeling body mass index (BMI dependent variable) with 20 dietary variables, total kilocalories available, physical activity, sex, discontinuation of breast feeding, and severe underweight in infancy. We are revising our published preprint (https://www.medrxiv.org/content/10.1101/2020.07.27.20162487v1)

With SAS Studio 9.4, we formatted 20 dietary variables into a single composite variable (BMI17f1) based on the worldwide mean kilocalories/day of each of the dietary variables times the R2 of the univariate correlation with BMI (dependent variable). See code attached.

In a VIF analysis, the 20 dietary variables and total kilocalories available all had VIF <10 (the cutoff for significant variable inflation). See attached code VIF testing pdf and the VIF testing results pdf. When the four other variables are added in the VIF analysis, milk (11.38) and child severe underweight (11.32) are >10. When the 20 dietary variables are formatted into BMI17f1 (a composite variable) and VIF tested with the other variables, VIF for all variables was < 10 (see code and results pdfs).

Is there any reason to delete any of the variables from the multiple regression analysis?

PaigeMiller · Posted 12-22-2020 07:45 AM

I used NFAC=5, which throws out 19/25 variables

PLS does not throw out variables. Five factors indicates that five new factors/dimensions (these are different words for the same thing) are computed and used in the modeling. All variables contribute to the fitted model, some more than others, according to what the data is saying, but nothing is thrown out. The final regression equation from PROC PLS will use all 25 variables.

The Tobias paper discussion began, “As discussed in the introductory section, soft science applications involve so many variables that it is not practical to seek a ‘‘hard’’ model explicitly relating them all.” Well, we claim that our 24 variable and soon to be 25 variable hard science formula modeling BMI gives excellent predictability. The resultant BMI formula performs equally well with the other eight Bradford Hill causality criteria—the “gold standard” criteria for providing proof in epidemiology.

Dr. Miller, can you please agree that the hard science Proc reg is more suitable than the soft science Proc PLS for this worldwide BMI modeling application and tell my statistician co-author that we shouldn’t have to throw out any variables?

PROC REG can be considered "more suitable" if you are willing to accept the affects of collinearity on your regression coefficients. But it is an empirical approach, it uses the data you provide to determine what the best fitting regression equation is, without regard for the known and previously determined (by others) BMI model. And so because either your data is different, or the collinearity causes sufficient problems, that you can get (and apparently do get, based on your earlier statements) coefficients with the wrong sign and coefficients that are so variable due to collinearity that they may be far away from the theoretical value. Maybe you want a mixed "empirical-hard model" model, but I have no idea how to get that, and I'm not even sure such a thing exists. (So I don't agree PROC REG is appropriate, it has the problems mentioned in this paragraph)

PLS is also empirical, it takes the data you provide and determines a predictive model, using a different algorithm than linear regression, and so it produces a different predictive model, without regard for the known and previously determined (by others) BMI model. But there's no getting around the fact that it is empirical. The benefit, as stated many times now, is that is it robust to multicollinearity and the regression coefficients will have the right sign (right sign based on the data and not based on the BMI model, which PLS doesn't use) and the regression coefficients will have low variability, but they are biased. Maybe you want a mixed "empirical-hard model" model, but I have no idea how to get that, and I'm not even sure such a thing exists.

Which brings us back to the very first question that I should have asked: what is the goal of this modeling? Is it to fit the data? Is it to confirm the BMI model holds on this data? Is it something else? When someone asks about VIFs and regression modeling, I assume they are talking about empirical modeling and the goal of the modeling is to find a predictive model that fits the data, but now it sounds like that is not the goal.

--
Paige Miller

View solution in original post

PaigeMiller · Posted 12-19-2020 05:38 AM

The rule I heard was if VIF>3, then that's a problem. However, I'm sure it's just a "rule of thumb", and nothing written in concrete. Maybe 10 is the right number.

High VIF to me doesn't necessarily require removal from the model, you may need to delete some other variable from the model.

In cases of many x variables with relatively high correlations, I suggest you use Partial Least Squares regression (PROC PLS). In this paper, Randall Tobias of SAS Institute, says:

Partial least squares (PLS) is a method for constructing predictive models when the factors are many and highly collinear.

There are plenty of other published papers making similar claims, and a study has shown that PLS produces estimates and predictions with lower mean square error than other similar methods. So PLS is what I recommend in your situation.

--
Paige Miller

dkcundiffMD · Posted 12-19-2020 11:43 AM

Thanks for the suggestion about proc PLS.

At issue with my statistician colleague is whether the formatting of the 20 dietary variables into a single compound variable in our model means the compound dietary variable should be the one considered for VIF or PLS as I contend. The 20 individual dietary variables will not appear in the Multiple regression formula--only the combination variable. Attached are the SAS codes and results of the multiple regression analysis. Should the variable at issue be the combination one or all 20 dietary variables?

We will also do the PLS analysis.

Thanks.

PaigeMiller · Posted 12-19-2020 03:41 PM

I have no idea about the formatting of 20 variables into a single compound variable ... I never download attachments, so I have no idea what you did. In addition, I have never even heard of such a thing.

Include the code in your reply, by clicking on the "running man" icon and pasting the text of your code into the window that appears.

--
Paige Miller

dkcundiffMD · Posted 12-19-2020 07:16 PM

Thanks for looking at this, Dr. Miller,

This is a worldwide population weighted database with cohorts from 195 countries.


libname IHME2017 "/folders/myfolders/IHME2017";

*Dietary variables are highly collinear. There is no way it would make sense to put 
20 dietary variables individually into a multiple regression. I did it and 8/20 of 
the variables switched signs (+ or -) from the univariate correlation to the multiple
regression formula sign. Our composite dietary variable was constructed with each variable 
multiplied times its kilocalories/day consumed and then times the R2 of the univariate 
correlation with BMI. We arrived at this formatting empirically. No one else has ever done this 
kind of statistical analysis of worldwide population weighted data from the Institute of Health 
Metrics and Evaluation, the World Health Organization, or anywhere.;

data source;
set IHME2017.source;
   label 
BMI17f1=20 foods combined risk factor;
  BMI17f1=
+	pmeat17KCsW	*	5.49	*	0.3649
+	rmeat17KCsW	*	50.70	*	0.4380
+	fish17KCsW	*	10.01	*	0.0105
+	milk17KCsW	*	25.37	*	0.4682
+	poultry16KCsW	*	45.06	*	0.6611
+	eggs16KCsW	*	19.47	*	0.4731
+	Alcohol17KCsW	*	81.71	*	0.0277
+	Sugarb17KCsW	*	297.65	*	0.0142
+	corn16KCsW	*	34.67	*	0.0053
+	potatoes16KCsW	*	84.16	*	0.0443
+	SFA16KCsW	*	191.27	*	0.5038 * 0.477
+	PUFA17KCsW 	*	82.24	*	0.5431 * 0.477
+	TFA17KCsW	*	13.40	*	0.2379 * 0.477
-	fruits17KCsW	*	40.39	*	0.3873
-	Vegetables17KCsW	*	80.14	*	0.270
-	nutsseeds17KCsW	*	8.51	*	0.2367
-	wgrains17KCsW	*	55.65	*	0.0402
-	legumes17KCsW	*	51.66	*	0.1477
-	rice16KCsW	*	141.23	*	0.3144
-	swtpot16KCsW	*	22.67	*	0.0230
;
run; quit;

*The first BMI versus risk factors multiple regression includes the composite dietary variable, physical activity, and sex;
Proc reg data=source;
	model BMI17msW=BMI17f1 PAMets17msW sex_IDsW
 	  	/ selection=STEPWISE slentry=.25 slstay=.25;
	run;quit;
	
* Results	
BMI17f1	0.00545	0.00004933	3843.16145	12213.9	<.0001
PAMets17msW	-0.13763	0.00710	118.27826	375.90	<.0001
sex_IDsW	0.12293	0.00662	108.47074	344.73	<.0001
1	BMI17f1	 	foods combined risk factor	1	0.6446	0.6446	1023.48	14300.9	<.0001
2	PAMets17msW	 	Physical activity METs	2	0.0271	0.6717	346.730	650.39	<.0001
3	sex_IDsW	 	sex male 1 female 2	3	0.0138	0.6855	4.0000	344.73	<.0001
*
;
*First step BMI formula with dietary variables, physical activity and sex R2=0.6855;
data source;
set source;
label 
BMI17f2="combo diet, physical act, and sex";
BMI17f2=
+	BMI17f1 * 0.00545
-	PAMets17msW	 * 0.13763
+	sex_IDsW * 0.12293	
;
run; quit;

*VIF testing of the three variables:
Proc reg data=source;
model BMI17msW=BMI17f1 PAMets17msW sex_IDsW
/ selection=STEPWISE slentry=.25 slstay=.25;
run;quit;

*Results

Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|	Variance Inflation
Intercept	Intercept	1	-1.6351E-13	0.00632	-0.00	1.0000	0
BMI17f1	20 foods combined risk factor	1	0.00545	0.00004933	110.52	<.0001	1.18131
PAMets17msW	Physical activity METs	1	-0.13763	0.00710	-19.39	<.0001	1.26268
sex_IDsW	Sex male 1 and female 2	1	0.12293	0.00662	18.57	<.0001	1.09860




*The second step of the entire multiple regression includes childhood severe underweight, discontinuation 
of breast feeding before 6 months and total kilocalories available;
Proc reg data=source;
	model BMI17msW=Childunwt17msW discbreastF17msW kcal2016msW 
 	  	/ selection=STEPWISE slentry=.25 slstay=.25;
	run;quit;
	
	* Results
Childunwt17MsW	-0.38646	0.00841	354.66255	2111.00	<.0001
discbreastF17msW	0.08612	0.00960	13.51799	80.46	<.0001
kcal2016msW	0.54075	0.00673	1085.87775	6463.29	<.0001
1	kcal2016msW	 	Total Kc/d avail	1	0.7112	0.7112	5671.77	19415.9	<.0001
2	Childunwt17MsW	 	Child/infant 2SD underweight	2	0.1191	0.8303	82.4609	5535.51	<.0001
3	discbreastF17msW	 	Discontinued breast feeding <6 mo	3	0.0017	0.8321	4.0000	80.46	<.0001
;

*VIF testing the second step;
Proc reg data=source;
	model BMI17msW=Childunwt17msW discbreastF17msW kcal2016msW 
 	  	/ VIF;
	run;quit;
*Results


Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|	Variance Inflation
Intercept	Intercept	1	-1.3083E-13	0.00462	-0.00	1.0000	0
Childunwt17MsW	Child/infant 2SD underweight	1	-0.38646	0.00841	-45.95	<.0001	3.32051
discbreastF17msW	Discontinued breast feeding <6 mo	1	0.08612	0.00960	8.97	<.0001	4.32648
kcal2016msW	Total Kc/d avail	1	0.54075	0.00673	80.39	<.0001	2.12331



*Second step BMI formula R2=0.8321;
data source;
set source;
label
BMI17f3="combo child underweight, discontinued breast feeding before 6 mo, and kcal";
BMI17f3=
- 	Childunwt17MsW * 0.38646
+	discbreastF17msW * 0.08612	
+	kcal2016msW * 0.54075	
;
run; quit;

*The third step to create the BMI formula is to combine the first two steps;
Proc reg data=source;
	model BMI17msW=BMI17f2 BMI17f3
 	  	/ selection=STEPWISE slentry=.25 slstay=.25;
	run;quit;
	
	* Results:
BMI17f2	0.35423	0.00804	261.63900	1941.01	<.0001
BMI17f3	0.74815	0.00730	1417.51533	10516.1	<.0001
1	BMI17f3	 	combo childUnWt, disc breast feeding, and kcal	1	0.8321	0.8321	1942.01	39060.5	<.0001
2	BMI17f2	 	combo diet, physical act, and sex	2	0.0332	0.8652	3.0000	1941.01	<.0001

*Third step VIF analysis results;

Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|	Variance Inflation
Intercept	Intercept	1	-1.4122E-13	0.00413	-0.00	1.0000	0
BMI17f2	combo diet, physical act, and sex	1	0.35423	0.00804	44.06	<.0001	2.59058
BMI17f3	combo childUnWt, disc breast feeding, and kcal	1	0.74815	0.00730	102.55	<.0001	2.59058



*From the second step BMI formula, we wanted to include only the variance not accounted for by the first
step BMI formula. We did this by subtracting the R2 of the first step BMI formula (0.6855) from the combined
first step BMI formula and second step BMI formula R2 (=0.8652);
BMI17f2 R2=0.6855
BMI17f2 + BMI17f3 R2=0.8652
BMI17f3 - BMI17f2= 0.8652 - 0.6855=0.1797
;

Proc reg data=source;
	model BMI17msW=BMI17f2 BMI17f3
 	  	/ VIF;
	run; quit;

*Final combined BMI formula R2=0.775438;
data source;
set source;
label
BMI17f4=BMI formula precursor;
BMI17f4=
+	BMI17f2 * 0.6855
+	BMI17f3 * 0.1797
;
run; quit;

* 
On an Excel spreadsheet, we divided up the percent weights of all the variables and 
adjusted them to total the Total BMI formula R2 (= 0.775438);
data source;
set source;
label
BMI17f5=Final BMI formula, risk factors expressed as percent weights;
BMI17f5=
(+	pmeat17KCsW	*	0.450
+	rmeat17KCsW	*	4.991
+	fish17KCsW	*	0.024
+	milk17KCsW	*	2.670
+	poultry16KCsW	*	6.695
+	eggs16KCsW	*	2.069
+	Alcohol17KCsW	*	0.509
+	Sugarb17KCsW	*	0.948
+	corn16KCsW	*	0.041
+	potatoes16KCsW	*	0.838
+	SFA16KCsW	*	10.330
+	PUFA17KCsW	*	4.788
+	TFA17KCsW	*	0.342
-	fruits17KCsW	*	3.516
-	Vegetables17KCsW	*	4.865
-	nutsseeds17KCsW	*	0.453
-	wgrains17KCsW	*	0.503
-	legumes17KCsW	*	1.715
-	rice16KCsW	*	9.978
-	swtpot16KCsW	*	0.117
-	PAMETs17msW	*	5.675
+	sex_idsW	*	5.069
-	Childunwt17msW	*	4.178
+	discbreastF17msW	*	0.931
+	kcal2016msW	* 5.845) * 0.05464712 + 21.78981
;
run; quit; 

  proc corr data=source fisher;
   label 
 BMI17f1=foods combined risk factor	
BMI17f2="Mult reg: combo diet, physical act, and sex"	 
BMI17f3="Mult reg: combo childUnWt, disc breast feeding, and kcal" 
BMI17f4=Precursor BMI formula
BMI17f5=Final BMI formula, risk factors expressed as percent weights;
var PAMets17msW sex_IDsW BMI17f1 BMI17f2 BMI17f3 BMI17f4 BMI17f5;
With BMI17m;
run; quit;

*Results

1 With Variables:	BMI17m
7 Variables:	PAMets17msW sex_IDsW BMI17f1 BMI17f2 BMI17f3 BMI17f4 BMI17f5


Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum	Label
BMI17m	7886	21.78981	2.31161	171834	17.94889	29.38579	BMI kg/M2
PAMets17msW	7886	-1.149E-13	1.00000	-9.059E-10	-2.26696	2.16518	Physical activity METs
sex_IDsW	7886	0	1.00000	0	-0.99994	0.99994	Sex male 1 and female 2
BMI17f1	7886	0	139.17979	0	-277.97823	429.93603	foods combined risk factor
BMI17f2	7886	0	0.82767	0	-1.81719	2.64672	Mult reg: combo diet, physical act, and sex
BMI17f3	7886	0	0.91216	0	-1.87979	2.11848	Mult reg: combo child underweight, discontinued breast feeding before 6 months, and kilocalories available/day
BMI17f4	7886	0	0.70322	0	-1.42007	2.10069	Precursor BMI formula
BMI17f5	7886	21.78981	2.31161	171834	17.12183	28.69532	Final BMI formula, risk factors expressed as percent weights


Pearson Correlation Coefficients, N = 7886 Prob > \|r\| under H0: Rho=0
	PAMets17msW	sex_IDsW	BMI17f1	BMI17f2	BMI17f3	BMI17f4	BMI17f5
BMI17m BMI kg/M2	-0.44498 <.0001	0.12202 <.0001	0.80288 <.0001	0.82793 <.0001	0.91217 <.0001	0.88060 <.0001	0.88059 <.0001

Pearson Correlation Statistics (Fisher's z Transformation)
Variable	With Variable	N	Sample Correlation	Fisher's z	Bias Adjustment	Correlation Estimate	95% Confidence Limits		p Value for H0:Rho=0
PAMets17msW	BMI17m	7886	-0.44498	-0.47843	-0.0000282	-0.44496	-0.462488	-0.427082	<.0001
sex_IDsW	BMI17m	7886	0.12202	0.12263	7.73718E-6	0.12201	0.100206	0.143692	<.0001
BMI17f1	BMI17m	7886	0.80288	1.10668	0.0000509	0.80287	0.794880	0.810574	<.0001
BMI17f2	BMI17m	7886	0.82793	1.18151	0.0000525	0.82791	0.820840	0.834730	<.0001
BMI17f3	BMI17m	7886	0.91217	1.54031	0.0000578	0.91216	0.908379	0.915796	<.0001
BMI17f4	BMI17m	7886	0.88060	1.37845	0.0000558	0.88059	0.875537	0.885453	<.0001
BMI17f5	BMI17m	7886	0.88059	1.37838	0.0000558	0.88057	0.875520	0.885437	<.0001

I'm sorry that I didn't present the problem/question to you this way in the first place. I hope that you will agree that, since 20 dietary variables are combined into one, the VIF test would be better than the PLS test with many variables in a multiple regression. In this analysis none of the 6 variables (not 25) had VIF>5.

What do you think?

Thank you.

David Cundiff

PaigeMiller · Posted 12-20-2020 07:46 AM

Why not combine all variables into one and then eliminate all multicollinearity everywhere? (That's a rhetorical question, I am not really suggesting you do that) The problem when you do this is that all of your variables are combined and you can't determine individual effects — you can't tell if pmeat17KCsW is a good predictor, or not. Same for all the other variables combined this way.

*Dietary variables are highly collinear. There is no way it would make sense to put 20 dietary variables individually into a multiple regression. I did it and 8/20 of the variables switched signs (+ or -) from the univariate correlation to the multipleregression formula sign. Our composite dietary variable was constructed with each variable multiplied times its kilocalories/day consumed and then times the R2 of the univariate correlation with BMI. We arrived at this formatting empirically. No one else has ever done this kind of statistical analysis of worldwide population weighted data from the Institute of Health Metrics and Evaluation, the World Health Organization, or anywhere.

Why would it not make sense to put 20 dietary variables individually into a multiple regressions? Don't you want to know which of these variables are good predictors? Would it not "make sense" because of multicollinearity? If that's what you are saying ... then I disagree ... you still want to know which of these variables are good predictors. I don't think you want to choose an algorithm and let this drive your choices and force you to throw away information (variables), you want an algorithm that allows you to obtain results that meet the needs.

With ordinary least squares regression, people spend a lot of time and get a lot of grey hairs trying to work around multicollinearity. There are many methods proposed that are time consuming and most require throwing away information (variables) and the problem remains that some multicollinearity exists in the x-variables, and all have certain drawbacks, especially stepwise regression, which I avoid like the plague.

So, I suggest you use a fitting algorithm that is extremely robust to multicollinearity. So robust, in fact, that most published papers don't even go through a step of eliminating variables, and they still get models that predict well and are useful. What is that method? It is PARTIAL LEAST SQUARES regression. Then you don't have to throw away variables, and you don't have to spend huge amounts of time figuring out what to do about multicollinearity.

Published papers sometimes have 1000 input variables, highly correlated with each other, and PLS fits the model well, and no effort is spent on eliminating the effects of multicollinearity. I already gave you a link to a paper that took 1000 input variables, highly correlated, and came up with a usable and well-fitting model, said paper written at SAS Institute (have you heard of them?). In this method, the effects of multicollinearity are low, the likelihood that terms in the model will flip signs is very low, and the predicted values have much lower mean squared error than via other methods studied (including stepwise).

--
Paige Miller

dkcundiffMD · Posted 12-20-2020 01:42 PM

Thanks again for working on this with me.

Don't you want to know which of these variables are good predictors? Would it not "make sense" because of multicollinearity? If that's what you are saying ... then I disagree ... you still want to know which of these variables are good predictors. I don't think you want to choose an algorithm and let this drive your choices and force you to throw away information (variables), you want an algorithm that allows you to obtain results that meet the needs.

I agree. We tested this algorithm with the nine Bradford Hill causality criteria and all nine criteria strongly supported the algorithm. https://www.medrxiv.org/content/10.1101/2020.07.27.20162487v1

Since that preprint was published in July, we satisfied the ninth Bradford Hill criterion, specificity. We are introducing a new methodology to nutritional epidemiology, a science that has been criticized as yielding predominantly implausible results (https://www.bmj.com/content/bmj/347/bmj.f6698.full.pdf). When we do a multiple regression of BMI (dependent variable) versus the 20 dietary variables, we get a BMI formula that accounts for 89% of the variance. However, it is nonsense because 8/20 variables flipped signs (+ or -) between univariate correlation and the BMI formula.

Our methodology uses worldwide Global Burden of Disease population weighted data with ecological cohorts representative of about 7.8 billion people in 2020. We consider that all of the 20 dietary variables are good predictors. However, these dietary variables need to be formatted optimally to satisfy the Bradford Hill criteria. Each of the dietary variables is adjusted according to its kilocalories/day consumed percapita and the R² of its correlation with BMI. When this is done, the correlation of the composite dietary variable with BMI accounts for about 68% of the variance.

With ordinary least squares regression, people spend a lot of time and get a lot of grey hairs trying to work around multicollinearity. There are many methods proposed that are time consuming and most require throwing away information (variables) and the problem remains that some multicollinearity exists in the x-variables, and all have certain drawbacks, especially stepwise regression, which I avoid like the plague.

So, I suggest you use a fitting algorithm that is extremely robust to multicollinearity. So robust, in fact, that most published papers don't even go through a step of eliminating variables, and they still get models that predict well and are useful. What is that method? It is PARTIAL LEAST SQUARES regression. Then you don't have to throw away variables, and you don't have to spend huge amounts of time figuring out what to do about multicollinearity.

I see how what you are saying would apply well to a field other than nutritional epidemiology. For instance, processed meat (5 kilocalories/day percapita) and red meat (50 kcal/day percapita) are both strongly correlated with BMI (say r=0.70 and r=0.60, respectively). In the BMI formula comprised of these 20 dietary risk factors, processed meat has a “+” sign and red meat has a “–“ sign. So red meat would have to be removed from the algorithm.

How would partial least squares regression solve this problem?

Published papers sometimes have 1000 input variables, highly correlated with each other, and PLS fits the model well, and no effort is spent on eliminating the effects of multicollinearity. I already gave you a link to a paper that took 1000 input variables, highly correlated, and came up with a usable and well-fitting model, said paper written at SAS Institute (have you heard of them?). In this method, the effects of multicollinearity are low, the likelihood that terms in the model will flip signs is very low, and the predicted values have much lower mean squared error than via other methods

We used a composite dietary risk factor methodology which includes all dietary risk factors. Is there any way that switching from our methodology to the partial least squared methodology with 20 individual dietary risk factors or the 12 ones remaining that didn’t switch signs with multiple regression would improve our results of satisfying all nine Bradford Hill causality criteria ((1) strength r= 0.907 (95% CI: 0.903 to 0.911) p<0.0001), (2) experiment: 20/20 bootstrap BMI formulas (n=100 cohorts) have the 25 risk factors with the same risk factor signs as the worldwide BMI formula, (3) consistency absolute difference between mean BMI and BMI formula output < 0.300 BMI units and less than one-third of subgroups are out of range of the 20 bootstrap validating randomly generated BMI formulas.

PaigeMiller · Posted 12-20-2020 01:57 PM

Use all 20 of your variables in the PLS Model. Flipping signs generally isn't a problem in PLS unless the BMI formula has the wrong sign. You see, PLS doesn't know anything about the BMI formula, it only knows that x is positively correlated with y, then it will almost always produce a positive sign on the regression coefficient. (Or if x is negatively correlated with y, then it will almost always produce a negative sign on the the regression coefficient).

Generally you don't remove variables from the PLS model unless you want to simplify the model by removing variables that are not good predictors. You could fit the model with all 20 variables, observe which ones have little effect, and then re-run the model without those variables that have little effect. Multicollinearity is not a consideration here.

--
Paige Miller

dkcundiffMD · Posted 12-20-2020 02:04 PM

I'll try it. Thanks.

dkcundiffMD · Posted 12-21-2020 01:55 PM

Below is a comparison of (1) standard multiple regression analysis with the formatted 
composite 20 variable diet risk factor, physical activity, and sex; (2) Partial least 
squares analysis with the formatted composite 20 variable diet risk factor, physical 
activity, and sex; and (3) PLS analysis with the individual 20 variable diet risk factors, 
physical activity, and sex.

libname IHME2017 "/folders/myfolders/IHME2017"; data source; set IHME2017.source; Where weighted_no le 7886; label BMI17f1=20 foods combined risk factor; BMI17f1= + pmeat17KCsW * 5.49 * 0.3649 + rmeat17KCsW * 50.70 * 0.4380 + fish17KCsW * 10.01 * 0.0105 + milk17KCsW * 25.37 * 0.4682 + poultry16KCsW * 45.06 * 0.6611 + eggs16KCsW * 19.47 * 0.4731 + Alcohol17KCsW * 81.71 * 0.0277 + Sugarb17KCsW * 297.65 * 0.0142 + corn16KCsW * 34.67 * 0.0053 + potatoes16KCsW * 84.16 * 0.0443 + SFA16KCsW * 191.27 * 0.5038 * 0.477 + PUFA17KCsW * 82.24 * 0.5431 * 0.477 + TFA17KCsW * 13.40 * 0.2379 * 0.477 - fruits17KCsW * 40.39 * 0.3873 - Vegetables17KCsW * 80.14 * 0.270 - nutsseeds17KCsW * 8.51 * 0.2367 - wgrains17KCsW * 55.65 * 0.0402 - legumes17KCsW * 51.66 * 0.1477 - rice16KCsW * 141.23 * 0.3144 - swtpot16KCsW * 22.67 * 0.0230 ; run; quit; *Standard multiple regression analysis with a composite dietary variable,

physical activity, and sex modeling body mass index worldwide; Proc reg data=source; model BMI17msW=BMI17f1 PAMets17msW sex_IDsW / selection=STEPWISE slentry=.25 slstay=.25; run; quit; * Results BMI17f1 0.00545 0.00004933 3843.16145 12213.9 <.0001 PAMets17msW -0.13763 0.00710 118.27826 375.90 <.0001 sex_IDsW 0.12293 0.00662 108.47074 344.73 <.0001 1 BMI17f1 foods combined risk factor 1 0.6446 0.6446 1023.48 14300.9 <.0001 2 PAMets17msW Physical activity METs 2 0.0271 0.6717 346.730 650.39 <.0001 3 sex_IDsW sex male 1 female 2 3 0.0138 0.6855 4.0000 344.73 <.0001 Accounts for 68.55% of the variance * ; *Partial least squares with a composite dietary variable method by contrast; PROC PLS data=source METHOD=PLS; MODEL BMI17msW = BMI17f1 PAMets17msW sex_IDsW ; *OUTPUT OUT=PLSmethod; run; quit; *Result The PLS Procedure Data Set WORK.SOURCE Factor Extraction Method Partial Least Squares PLS Algorithm NIPALS Number of Response Variables 1 Number of Predictor Parameters 3 Missing Value Handling Exclude Number of Factors 3 Number of Observations Read 7886 Number of Observations Used 7886 The PLS Procedure Percent Variation Accounted for by Partial Least Squares Factors Number of Extracted Factors Model Effects Dependent Variables Current Total Current Total 1 45.4455 45.4455 64.7483 64.7483 2 27.6348 73.0804 3.4163 68.1646 3 26.9196 100.0000 0.3819 68.5465 Accounts for 68.55% of the variance ; *Partial least squares with 20 individual diet variables; PROC PLS data=source METHOD=PLS; MODEL BMI17msW = pmeat17KCsW rmeat17KCsW fish17KCsW milk17KCsW poultry16KCsW eggs16KCsW Alcohol17KCsW Sugarb17KCsW corn16KCsW potatoes16KCsW SFA16KCsW PUFA17KCsW TFA17KCsW fruits17KCsW Vegetables17KCsW nutsseeds17KCsW wgrains17KCsW legumes17KCsW rice16KCsW swtpot16KCsW PAMets17msW sex_IDsW ; run; quit; *The PLS Procedure with 20 individual dietary factors, physical activity and sex; Percent Variation Accounted for by Partial Least Squares Factors Number of Extracted Factors Model Effects Dependent Variables Current Total Current Total 1 32.7157 32.7157 77.2347 77.2347 2 7.4541 40.1698 9.1991 86.4338 3 6.6411 46.8109 1.6277 88.0615 4 5.3212 52.1322 0.9785 89.0400 5 5.4688 57.6010 0.4847 89.5247 6 6.6591 64.2601 0.1593 89.6840 7 2.9929 67.2530 0.1715 89.8554 8 4.0856 71.3386 0.0753 89.9307 9 4.9696 76.3082 0.0284 89.9591 10 1.9025 78.2107 0.0252 89.9844 11 4.2921 82.5028 0.0042 89.9886 12 2.5422 85.0450 0.0047 89.9932 13 2.5712 87.6162 0.0019 89.9951 14 2.9205 90.5367 0.0007 89.9958 15 2.6019 93.1386 0.0001 89.9959 The top 15/22 variables account for 90% of the variance, but flipping signs of the dietary variables make this nonsense.

Since PLS analysis gives us the same result (R2) as proc reg with our formatted 20 diet risk

factor variable, is it reasonable for us to use proc reg this way and not worry about the VIF test

used with the 20 individual dietary risk factors? Hopefully, you will save me from having to throw

out variables, which I do not want to do.

I would be honored to list you as a contributor on our paper when we submit it to the Lancet.

Many thanks.

David Cundiff

dkcundiffMD · Posted 12-21-2020 02:01 PM

Sorry, I put my text rather than the code in the code box:

libname IHME2017 "/folders/myfolders/IHME2017";

data source;
set IHME2017.source;
  Where weighted_no le 7886; 
   label 
BMI17f1=20 foods combined risk factor;
  BMI17f1=
+	pmeat17KCsW	*	5.49	*	0.3649
+	rmeat17KCsW	*	50.70	*	0.4380
+	fish17KCsW	*	10.01	*	0.0105
+	milk17KCsW	*	25.37	*	0.4682
+	poultry16KCsW	*	45.06	*	0.6611
+	eggs16KCsW	*	19.47	*	0.4731
+	Alcohol17KCsW	*	81.71	*	0.0277
+	Sugarb17KCsW	*	297.65	*	0.0142
+	corn16KCsW	*	34.67	*	0.0053
+	potatoes16KCsW	*	84.16	*	0.0443
+	SFA16KCsW	*	191.27	*	0.5038 * 0.477
+	PUFA17KCsW 	*	82.24	*	0.5431 * 0.477
+	TFA17KCsW	*	13.40	*	0.2379 * 0.477
-	fruits17KCsW	*	40.39	*	0.3873
-	Vegetables17KCsW	*	80.14	*	0.270
-	nutsseeds17KCsW	*	8.51	*	0.2367
-	wgrains17KCsW	*	55.65	*	0.0402
-	legumes17KCsW	*	51.66	*	0.1477
-	rice16KCsW	*	141.23	*	0.3144
-	swtpot16KCsW	*	22.67	*	0.0230
;
run; quit;

Proc reg data=source;
	model BMI17msW=BMI17f1 PAMets17msW sex_IDsW
 	  	/ selection=STEPWISE slentry=.25 slstay=.25;
	run;quit;
	
* Results	
BMI17f1	0.00545	0.00004933	3843.16145	12213.9	<.0001
PAMets17msW	-0.13763	0.00710	118.27826	375.90	<.0001
sex_IDsW	0.12293	0.00662	108.47074	344.73	<.0001
1	BMI17f1	 	foods combined risk factor	1	0.6446	0.6446	1023.48	14300.9	<.0001
2	PAMets17msW	 	Physical activity METs	2	0.0271	0.6717	346.730	650.39	<.0001
3	sex_IDsW	 	sex male 1 female 2	3	0.0138	0.6855	4.0000	344.73	<.0001
*
;

*Partial least squares with a composite dietary variable method by contrast; 
PROC PLS data=source METHOD=PLS;
MODEL BMI17msW = BMI17f1 PAMets17msW sex_IDsW  ;
*OUTPUT OUT=PLSmethod;
run; quit;
*Result
The PLS Procedure

Data Set	WORK.SOURCE
Factor Extraction Method	Partial Least Squares
PLS Algorithm	NIPALS
Number of Response Variables	1
Number of Predictor Parameters	3
Missing Value Handling	Exclude
Number of Factors	3
Number of Observations Read	7886
Number of Observations Used	7886
The PLS Procedure

Percent Variation Accounted for by Partial Least Squares Factors
Number of Extracted Factors	Model Effects	Dependent Variables
Current	Total	Current	Total
1	45.4455	45.4455	64.7483	64.7483
2	27.6348	73.0804	3.4163	68.1646
3	26.9196	100.0000	0.3819	68.5465
;

*Partial least squares with 20 individual diet variables;
PROC PLS data=source METHOD=PLS;
MODEL BMI17msW = pmeat17KCsW rmeat17KCsW fish17KCsW milk17KCsW poultry16KCsW eggs16KCsW
 Alcohol17KCsW Sugarb17KCsW corn16KCsW potatoes16KCsW SFA16KCsW PUFA17KCsW TFA17KCsW
fruits17KCsW Vegetables17KCsW nutsseeds17KCsW wgrains17KCsW	legumes17KCsW rice16KCsW swtpot16KCsW 
PAMets17msW sex_IDsW  ;
*OUTPUT OUT=PLSmethod;
run; quit;

*The PLS Procedure

Percent Variation Accounted for by Partial Least Squares Factors
Number of Extracted Factors	Model Effects	Dependent Variables
Current	Total	Current	Total
1	32.7157	32.7157	77.2347	77.2347
2	7.4541	40.1698	9.1991	86.4338
3	6.6411	46.8109	1.6277	88.0615
4	5.3212	52.1322	0.9785	89.0400
5	5.4688	57.6010	0.4847	89.5247
6	6.6591	64.2601	0.1593	89.6840
7	2.9929	67.2530	0.1715	89.8554
8	4.0856	71.3386	0.0753	89.9307
9	4.9696	76.3082	0.0284	89.9591
10	1.9025	78.2107	0.0252	89.9844
11	4.2921	82.5028	0.0042	89.9886
12	2.5422	85.0450	0.0047	89.9932
13	2.5712	87.6162	0.0019	89.9951
14	2.9205	90.5367	0.0007	89.9958
15	2.6019	93.1386	0.0001	89.9959
;

PaigeMiller · Posted 12-21-2020 04:09 PM

Since PLS analysis gives us the same result (R2) as proc reg with our formatted 20 diet risk factor variable, is it reasonable for us to use proc reg this way and not worry about the VIF test used with the 20 individual dietary risk factors?

No.

You haven't run PLS properly. If you leave out the NFAC= option, you get 15 factors, and so this comes very very very very very close to the regression results with almost the same regression coefficients and almost the same r-squared as PROC REG. And without the NFAC= option, all of the multicollinearity problems are built into your results.

The PROC PLS analysis should (must!) have a NFAC= value specified, usually determined by cross-validation to limit the number of factors extracted. By limiting the number of factors extracted by PLS, you then move away from the situation where multi-collinearity is a major problem and you move towards the situation where there is little impact from multicollinearity. Another consequence of limiting the number of factors extracted is that the model will not fit as well as a regression (lower r-squared) but the coefficients and predicted values will have lower mean squared error because the VIFs are much smaller. Thus the model is more stable and less prone to having wild swings in the coefficients and wrong signs in the coefficients.

One thing you can see in your output is that the first two factors explain 86% of the variability of Y while if you keep adding factors, you get to the results that are extremely close to regression where 90% of the Y variability is explained. So, that 86% with little impact of multicollinearity from 2 dimensions and the regression coefficients have low VIFs — versus — 90% with 15 factors and the regression coefficients are heavily impacted by multicollinearity and have huge VIFs. That's the tradeoff, an extra 4% r-squared in this case but regression coefficients that have huge VIFs and a model where the predicted values have high mean square error, or you lose that extra 4% but regression coefficients that have much lower VIFs and predicted values with much lower mean square errors.

I gave you a link to the documentation for PROC PLS where examples and correct syntax are explained in detail, and proper use of PLS is illustrated.

--
Paige Miller

dkcundiffMD · Posted 12-22-2020 12:09 AM

I used NFAC=5, which throws out 19/25 variables, and got the following result:

The PLS Procedure
Data Set	WORK.SOURCE
Factor Extraction Method	Partial Least Squares
PLS Algorithm	NIPALS
Number of Response Variables	1
Number of Predictor Parameters	22
Missing Value Handling	Exclude
Number of Factors	5
Number of Observations Read	7886
Number of Observations Used	7886
________________________________________
The PLS Procedure
Percent Variation Accounted for by Partial Least Squares Factors
Number of Extracted Factors	Model Effects	Dependent Variables
	Current	Total	Current	Total
1	32.7157	32.7157	77.2347	77.2347
2	7.4541	40.1698	9.1991	86.4338
3	6.6411	46.8109	1.6277	88.0615
4	5.3212	52.1322	0.9785	89.0400
5	5.4688	57.6010	0.4847	89.5247

I read the Tobias paper and understood the gist but not all the details. I’m not a statistician. I am a retired internal medicine physician who got my undergraduate degree in mathematics long before SAS was a company and when we all carried slide rules.

The Tobias abstract began saying, “Partial least squares is a popular method for soft modelling in industrial applications.” The referenced uses of PLS included econometrics, neural networks, chemometrics, and social science. A 2013 paper titled, “Evaluation of methodologies for assessing the overall diet: dietary quality scores and dietary pattern analysis” discussed three main approaches to study the overall diet—(1) dietary guidelines based, (2) principle component analysis or cluster analysis, (3) reduced rank regression. (https://pubmed.ncbi.nlm.nih.gov/23360896/). PLS was not mentioned. However, a 2017 paper titled, “An application of for identifying dietary patterns in bone health” did use PLS. They introduced PLS saying, “Partial least-squares (PLS) is a data-reduction technique for identifying dietary patterns that maximizes correlation between foods and nutrients hypothesized to be on the path to disease, is more hypothesis-driven than previous methods, and has not been applied to the study of dietary patterns in relation to bone health.” (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5506508/#:~:text=Partial%20least%2Dsquares%20(PLS),patt.... We are not researching dietary patterns that maximize correlations between foods and nutrients. We are modeling the dietary and other risk factors for BMI worldwide.

The Tobias paper introduction included, “In such so-called soft science applications, the researcher is faced with many variables and ill-understood relationships, and the object is merely to construct a good predictive model.” Tobias goes on to say that a good predictive model is usually defined by cross-validation. The BMI formula that we modeled with 24 total risk factors, including 20 dietary risk factors used as a composite variable, was tested by cross-validation with 20 random trials of 100 cohorts (out of a database with 7886 cohorts) and, in each of these cross-validation trials, no signs were flipped and the cross validating BMI formulas were all very close to the initial BMI formula (See Table 1).

Table 1. Cross-validation analysis: BMI formulas with 20 groups each with 100 random cohorts 
	BMI Risk factors included in the BMI formula (n=7886 cohorts, representing about 7.8 billion 
people)	                                                      
                             BMI formula worldwide  BMI formulas of 20 groups of 100 cohorts each    
                               Mean percent weight  Mean percent weight	SD	   Minimum  Maximum
+	Processed meat kcal/day          0.266	           0.246	        0.093	0.050	0.472
+	Red meat kcal/day	             2.951	           2.542	        0.947	0.734	4.762
+	Fish kcal/day	                 0.014	           0.064	        0.152	0.004	0.610
+	Milk kcal/day	                 1.578	           1.362	        0.456	0.404	2.327
+	Poultry kcal/day 	             3.958	           3.364	        1.023	1.181	5.691
+	Eggs kcal/day 	                 1.224	           1.084	        0.353	0.486	1.912
+	Alcohol kcal/day	             0.301	           0.279	        0.307	0.001	0.969
+	Sugary beverages kcal/day	     0.562	           0.901	        0.886	0.000	3.247
+	Corn kcal/day 	                 0.024	           0.072	        0.067	0.000	0.227
+	Potatoes kcal/day 	             0.495	           0.396	        0.366	0.001	1.366
+	Saturated fatty acids kcal/day	 6.107	           5.203	        1.892	1.986	8.805
+	Polyunsaturated fatty acids      2.831	           2.423	        0.832	0.916	4.503
+	Trans fatty acids kcal/day	     0.202	           0.204	        0.082	0.063	0.354
-	Fruits kcal/day	                 2.078	           1.899	        0.456	1.248	2.925
-	Vegetables kcal/day	             2.875	           2.596	        0.596	1.222	3.530
-	Nuts and seeds kcal/day	         0.268	           0.234	        0.093	0.035	0.434
-	Whole grains kcal/day	         0.297	           0.263	        0.206	0.000	0.679
-	Legumes kcal/day	             1.014	           0.880	        0.366	0.325	1.718
-	Rice kcal/day 	                 5.900	           5.091	        2.056	1.769	9.041
-	Sweet potatoes kcal/day 	     0.069	           0.074	        0.050	0.006	0.179
-	Physical activity	            11.797	          12.503	        2.671	7.660	17.190
-	Child severe underweight	    19.850	          20.781	        3.580	15.267	28.867
+	Discontinued breast feeding 	13.090	          13.679	        2.084	9.525	16.933
+	Sex (1=male, 2=female)	         4.479	           5.143	        2.148	1.053	8.642
	Total percent weights	        82.230	          81.280	        3.550	75.240	86.930

The Tobias paper discussion began, “As discussed in the introductory section, soft science applications involve so many variables that it is not practical to seek a ‘‘hard’’ model explicitly relating them all.” Well, we claim that our 24 variable and soon to be 25 variable hard science formula modeling BMI gives excellent predictability. The resultant BMI formula performs equally well with the other eight Bradford Hill causality criteria—the “gold standard” criteria for providing proof in epidemiology.

Dr. Miller, can you please agree that the hard science Proc reg is more suitable than the soft science Proc PLS for this worldwide BMI modeling application and tell my statistician co-author that we shouldn’t have to throw out any variables?

Using this population weighted and formatted database of Global Burden of Disease data from the Institute of Health Metrics and Evaluation, my co-author and I serve as volunteer collaborators modeling risk factors related to health outcomes. We would be happy to help SAS students to model any of over 50 other health outcomes for which we have the formatted data (e.g., systolic blood pressure, cardiovascular diseases, cancers, etc.). I am old. Fifty more papers like this would be too much for me to accomplish in this lifetime. The only requirement would be to register as a volunteer collaborator with the Institute of Health Metrics and Evaluation. By the way this Bill and Melinda Gates funded institute hires many statisticians (http://www.healthdata.org/).

Many thanks, Dr. Miller, for your help.

David Cundiff

PaigeMiller · Posted 12-22-2020 07:45 AM

I used NFAC=5, which throws out 19/25 variables

PLS does not throw out variables. Five factors indicates that five new factors/dimensions (these are different words for the same thing) are computed and used in the modeling. All variables contribute to the fitted model, some more than others, according to what the data is saying, but nothing is thrown out. The final regression equation from PROC PLS will use all 25 variables.

The Tobias paper discussion began, “As discussed in the introductory section, soft science applications involve so many variables that it is not practical to seek a ‘‘hard’’ model explicitly relating them all.” Well, we claim that our 24 variable and soon to be 25 variable hard science formula modeling BMI gives excellent predictability. The resultant BMI formula performs equally well with the other eight Bradford Hill causality criteria—the “gold standard” criteria for providing proof in epidemiology.

Dr. Miller, can you please agree that the hard science Proc reg is more suitable than the soft science Proc PLS for this worldwide BMI modeling application and tell my statistician co-author that we shouldn’t have to throw out any variables?

PROC REG can be considered "more suitable" if you are willing to accept the affects of collinearity on your regression coefficients. But it is an empirical approach, it uses the data you provide to determine what the best fitting regression equation is, without regard for the known and previously determined (by others) BMI model. And so because either your data is different, or the collinearity causes sufficient problems, that you can get (and apparently do get, based on your earlier statements) coefficients with the wrong sign and coefficients that are so variable due to collinearity that they may be far away from the theoretical value. Maybe you want a mixed "empirical-hard model" model, but I have no idea how to get that, and I'm not even sure such a thing exists. (So I don't agree PROC REG is appropriate, it has the problems mentioned in this paragraph)

PLS is also empirical, it takes the data you provide and determines a predictive model, using a different algorithm than linear regression, and so it produces a different predictive model, without regard for the known and previously determined (by others) BMI model. But there's no getting around the fact that it is empirical. The benefit, as stated many times now, is that is it robust to multicollinearity and the regression coefficients will have the right sign (right sign based on the data and not based on the BMI model, which PLS doesn't use) and the regression coefficients will have low variability, but they are biased. Maybe you want a mixed "empirical-hard model" model, but I have no idea how to get that, and I'm not even sure such a thing exists.

Which brings us back to the very first question that I should have asked: what is the goal of this modeling? Is it to fit the data? Is it to confirm the BMI model holds on this data? Is it something else? When someone asks about VIFs and regression modeling, I assume they are talking about empirical modeling and the goal of the modeling is to find a predictive model that fits the data, but now it sounds like that is not the goal.

--
Paige Miller

dkcundiffMD · Posted 12-22-2020 02:43 PM

Thanks, Dr. Miller, for these comments.

“PLS does not throw out variables. Five factors indicates that five new factors/dimensions (these are different words for the same thing) are computed and used in the modeling. All variables contribute to the fitted model, some more than others, according to what the data is saying, but nothing is thrown out. The final regression equation from PROC PLS will use all 25 variables.”.

What is critical to our methodology and why it makes sense out of multicollinearity chaos is that, in the 20 food composite variable, each individual food is multiplied by the kilocalories/day on average consumed worldwide and also by the R² of the correlation with BMI. PROC PLS did fine with the three independent variable multiple regression of the (1) composite dietary variable, (2) physical activity, and (3) sex (variance accounted for by the BMI formula=68.55%, same as with PROC REG. But without that initial food variable formatting, PROC PLS was not helpful with the 20 individual foods. In addition, PROC REG allows us to take the SAS results to do the necessary calculations in Excel to create the final BMI formula with the coefficients of each dietary and other variable expressed in percent weights (sometime termed “population attributable fractions”), totaling the BMI formula percent weight. Creating the final 25 risk factor BMI formula this way and harmonizing it with worldwide BMI by equating their SDs and mean values allows much more. We can test the functionality of the formula with the nine Bradford Hill causality criteria (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1898525/) and test different risk factor scenarios based on BMI formula estimates as shown in Table 2.

“PLS does not throw out variables. Five factors indicates that five new factors/dimensions (these are different words for the same thing) are computed and used in the modeling. All variables contribute to the fitted model, some more than others, according to what the data is saying, but nothing is thrown out. The final regression equation from PROC PLS will use all 25 variables.”.

What is critical to our methodology and why it makes sense out of multicollinearity chaos is that, in the 20 food composite variable, each individual food is multiplied by the kilocalories/day on average consumed worldwide and also by the R² of the correlation with BMI. PROC PLS did fine with the three independent variable multiple regression of the (1) composite dietary variable, (2) physical activity, and (3) sex (variance accounted for by the BMI formula=68.55%, same as with PROC REG. But without that initial food variable formatting, PROC PLS was not helpful with the 20 individual foods. In addition, PROC REG allows us to take the SAS results to do the necessary calculations in Excel to create the final BMI formula with the coefficients of each dietary and other variable expressed in percent weights (sometime termed “population attributable fractions”), totaling the BMI formula percent weight. Creating the final 25 risk factor BMI formula this way and harmonizing it with worldwide BMI by equating their SDs and mean values allows much more. We can test the functionality of the formula with the nine Bradford Hill causality criteria (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1898525/) and test different risk factor scenarios based on BMI formula estimates as shown in Table 2.

Table 2. Testing common dietary and other risk factor scenarios with BMI formula estimates

Dietary and other risk factor scenarios	BMI kg/M²	BMI formula estimate
World (n=7886 cohorts mean 1990-2017)	21.79	21.79
World with breast feeding discontinued < 6 month in ALL		22.00
World with breast feeding discontinued < 6 month in NONE		21.76
World with ALL children with severe underweight		20.68
World with NO children with severe underweight		22.04
United Kingdom (n=66 cohorts) UK best fit BMI formula	24.99	25.45
USA (n=376 cohorts, 2004—the mean of 1990 and 2017)	26.66	27.27
Following USA Dietary Guidelines 2015-2020 standard recommendations		23.43
Following USA Dietary Guidelines 2015-2020 Mediterranean diet recommendations		22.57
Following USA Dietary Guidelines 2015-2020 vegetarian diet recommendations		21.52
USA with 25% reduction of BMI increasing food intake†		23.31
USA with 50% reduction of BMI increasing food intake†		19.35
USA mean physical activity plus 1 hours/day running at 6 mph		26.31
USA mean physical activity plus 1 hour/day running at 6 mph and 25% reduction of BMI increasing food intake†		22.35
USA with no red or processed meat†		24.44
USA with no sugary beverage intake†		26.44
USA vegetarian (no meat, poultry, fish)†		21.42
USA vegan (no meat, poultry, fish, dairy or eggs)†		19.56
EAT-Lancet diet†		22.88
Low Carbohydrate Mediterranean Diet†		32.53

β BMI formula estimates based on 28 years of following dietary and risk factor patterns

† Kcal/day 13 BMI increasing foods isocalorically shifted to the 7 BMI decreasing foods in the BMI formula, distributed equally.

“PROC REG can be considered "more suitable" if you are willing to accept the affects of collinearity on your regression coefficients.”

I am more than willing to accept the affects of collinearity on my regression coefficients because modeling dietary risk factors isn’t like modeling some engineering or chemometric application where the coefficients have to be exactly right. As shown in Table 1 (several posts ago) with the 20 cross validation trials, each coefficient has a fairly wide range of experimentally valid values.

"But it is an empirical approach, it uses the data you provide to determine what the best fitting regression equation is, without regard for the known and previously determined (by others) BMI model."

There is no known and previously determined by other BMI model. The recently published Institute of Health Metrics and Evaluation (IHME) Global Burden of disease risk factor paper said, “At the global level, we find that high BMI is rising considerably faster than low physical activity and poor diet quality. ... Some studies suggest that certain diet components are more likely to contribute to increased BMI than others; the mechanism of these effects can be complex and include effects on appetite, absorption, and displacement of other foods.35 It is currently hard to understand the role of physical inactivity, excess caloric intake, and diet quality in driving the increase in BMI.” (https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30752-2/fulltext).

IHME, the place that sent me, as a registered volunteer collaborator, 1.4 gigabytes of raw data on BMI and over 30 risk factors, does not use their own data to model BMI or other health outcomes. Instead, they do systematic literature reviews to draw conclusions about causality of health outcomes.

"And so because either your data is different, or the collinearity causes sufficient problems, that you can get (and apparently do get, based on your earlier statements) coefficients with the wrong sign and coefficients that are so variable due to collinearity that they may be far away from the theoretical value. Maybe you want a mixed "empirical-hard model" model, but I have no idea how to get that, and I'm not even sure such a thing exists. (So I don't agree PROC REG is appropriate, it has the problems mentioned in this paragraph)”

For the sake of argument, consider that we may have developed as least a prototype of an “empirical-hard model” model. But how would you be able to tell a good empirical-hard model model from a poor one? Consider vetting the model with the classical Bradford Hill causality criteria used in epidemiology: (1) strength, (2) experiment (e.g., cross validation), (3) consistency, (4) biologic gradient (i.e., dose response), (5) temporality, (6) analogy, (7) plausibility, (8) specificity, and (9) coherence. Our empirical methodology aced all of these nine criteria. You can read a previous draft of our BMI formula paper to find our errors: https://www.medrxiv.org/content/10.1101/2020.07.27.20162487v1. As soon as my statistician co-author agrees to include all 25 risk factors and completes the cross validation step, we will update the preprint, which will be at the same link, hopefully, within a week or two. Preprints are not peer reviewed. It would be help us greatly to get the paper into external peer review at the Lancet if you and/or other SAS experts or aspiring experts would peer review the paper and post them as a comments.

Which brings us back to the very first question that I should have asked: what is the goal of this modeling? Is it to fit the data? Is it to confirm the BMI model holds on this data? Is it something else?

The goal of the modeling is to win the Nobel Prize in medicine. Kidding. All of the above—we need to fit the data, confirm the BMI model holds on this data, show that this new empirical methodology works for BMI and potentially for hundreds of other noncommunicable disease health outcomes, and get the paper published in a consequential journal like the Lancet.

This modeling also has a potential practical public health application of creating an app based on the BMI formula to provide feedback to individuals wanting to control their weight by diet and exercise. My website has a proof of concept prototype of such an app called, “Future body mass index (BMI) estimator based on diet and exercise.” This app is primitive compared to using IHME GBD data. I created BMI formulas from each of two databases (Diabetes Control and Complications Trial and World Health Organization/Food and Agriculture data) and merged the results of BMI modeling with these two databases into the framework for the app. You input your dietary pattern, exercise level, age, sex, height, and weight and it reports your estimated BMI in 1, 5, 10, and 20 years. You can then change components of your diet and/or exercise data to see what options would lead you to reach your weight control goal.

When someone asks about VIFs and regression modeling, I assume they are talking about empirical modeling and the goal of the modeling is to find a predictive model that fits the data, but now it sounds like that is not the goal.

Our goal only starts with finding a predictive model that fits the data. We have done that. And then it extends up to and including SAS students in college and elsewhere collaborating with us to model many more health outcomes with this methodology. These GBD data cover the years 1990-2017 and will become outdated when the 1990-2019 data (with everything updated) become available to GBD researchers. Eventually, more preprints of GBD data modeling health outcomes will lead to IHME realizing that using their data for statistically modeling health outcomes should complement their razor focus on systematic literature reviews to draw conclusions leading to public health policy strategies.

Spread the word!

Many thanks, Dr. Miller.

The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

Re: The proper use of variance inflation factor (VIF) test in multiple regression analysis

The 2025 SAS Hackathon has begun!