Thanks, Dr. Miller, for these comments.
“PLS does not throw out variables. Five factors indicates that five new factors/dimensions (these are different words for the same thing) are computed and used in the modeling. All variables contribute to the fitted model, some more than others, according to what the data is saying, but nothing is thrown out. The final regression equation from PROC PLS will use all 25 variables.”.
What is critical to our methodology and why it makes sense out of multicollinearity chaos is that, in the 20 food composite variable, each individual food is multiplied by the kilocalories/day on average consumed worldwide and also by the R 2 of the correlation with BMI. PROC PLS did fine with the three independent variable multiple regression of the (1) composite dietary variable, (2) physical activity, and (3) sex (variance accounted for by the BMI formula=68.55%, same as with PROC REG. But without that initial food variable formatting, PROC PLS was not helpful with the 20 individual foods. In addition, PROC REG allows us to take the SAS results to do the necessary calculations in Excel to create the final BMI formula with the coefficients of each dietary and other variable expressed in percent weights (sometime termed “population attributable fractions”), totaling the BMI formula percent weight. Creating the final 25 risk factor BMI formula this way and harmonizing it with worldwide BMI by equating their SDs and mean values allows much more. We can test the functionality of the formula with the nine Bradford Hill causality criteria (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1898525/) and test different risk factor scenarios based on BMI formula estimates as shown in Table 2.
“PLS does not throw out variables. Five factors indicates that five new factors/dimensions (these are different words for the same thing) are computed and used in the modeling. All variables contribute to the fitted model, some more than others, according to what the data is saying, but nothing is thrown out. The final regression equation from PROC PLS will use all 25 variables.”.
What is critical to our methodology and why it makes sense out of multicollinearity chaos is that, in the 20 food composite variable, each individual food is multiplied by the kilocalories/day on average consumed worldwide and also by the R 2 of the correlation with BMI. PROC PLS did fine with the three independent variable multiple regression of the (1) composite dietary variable, (2) physical activity, and (3) sex (variance accounted for by the BMI formula=68.55%, same as with PROC REG. But without that initial food variable formatting, PROC PLS was not helpful with the 20 individual foods. In addition, PROC REG allows us to take the SAS results to do the necessary calculations in Excel to create the final BMI formula with the coefficients of each dietary and other variable expressed in percent weights (sometime termed “population attributable fractions”), totaling the BMI formula percent weight. Creating the final 25 risk factor BMI formula this way and harmonizing it with worldwide BMI by equating their SDs and mean values allows much more. We can test the functionality of the formula with the nine Bradford Hill causality criteria (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1898525/) and test different risk factor scenarios based on BMI formula estimates as shown in Table 2.
Table 2. Testing common dietary and other risk factor scenarios with BMI formula estimates
Dietary and other risk factor scenarios
BMI kg/M 2
BMI formula estimate
World (n=7886 cohorts mean 1990-2017)
21.79
21.79
World with breast feeding discontinued < 6 month in ALL
22.00
World with breast feeding discontinued < 6 month in NONE
21.76
World with ALL children with severe underweight
20.68
World with NO children with severe underweight
22.04
United Kingdom (n=66 cohorts) UK best fit BMI formula
24.99
25.45
USA (n=376 cohorts, 2004—the mean of 1990 and 2017)
26.66
27.27
Following USA Dietary Guidelines 2015-2020 standard recommendations
23.43
Following USA Dietary Guidelines 2015-2020 Mediterranean diet recommendations
22.57
Following USA Dietary Guidelines 2015-2020 vegetarian diet recommendations
21.52
USA with 25% reduction of BMI increasing food intake†
23.31
USA with 50% reduction of BMI increasing food intake†
19.35
USA mean physical activity plus 1 hours/day running at 6 mph
26.31
USA mean physical activity plus 1 hour/day running at 6 mph and 25% reduction of BMI increasing food intake†
22.35
USA with no red or processed meat†
24.44
USA with no sugary beverage intake†
26.44
USA vegetarian (no meat, poultry, fish)†
21.42
USA vegan (no meat, poultry, fish, dairy or eggs)†
19.56
EAT-Lancet diet†
22.88
Low Carbohydrate Mediterranean Diet†
32.53
β BMI formula estimates based on 28 years of following dietary and risk factor patterns
† Kcal/day 13 BMI increasing foods isocalorically shifted to the 7 BMI decreasing foods in the BMI formula, distributed equally.
“PROC REG can be considered "more suitable" if you are willing to accept the affects of collinearity on your regression coefficients.”
I am more than willing to accept the affects of collinearity on my regression coefficients because modeling dietary risk factors isn’t like modeling some engineering or chemometric application where the coefficients have to be exactly right. As shown in Table 1 (several posts ago) with the 20 cross validation trials, each coefficient has a fairly wide range of experimentally valid values.
"But it is an empirical approach, it uses the data you provide to determine what the best fitting regression equation is, without regard for the known and previously determined (by others) BMI model."
There is no known and previously determined by other BMI model. The recently published Institute of Health Metrics and Evaluation (IHME) Global Burden of disease risk factor paper said, “At the global level, we find that high BMI is rising considerably faster than low physical activity and poor diet quality. ... Some studies suggest that certain diet components are more likely to contribute to increased BMI than others; the mechanism of these effects can be complex and include effects on appetite, absorption, and displacement of other foods.35 It is currently hard to understand the role of physical inactivity, excess caloric intake, and diet quality in driving the increase in BMI.” (https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30752-2/fulltext).
IHME, the place that sent me, as a registered volunteer collaborator, 1.4 gigabytes of raw data on BMI and over 30 risk factors, does not use their own data to model BMI or other health outcomes. Instead, they do systematic literature reviews to draw conclusions about causality of health outcomes.
"And so because either your data is different, or the collinearity causes sufficient problems, that you can get (and apparently do get, based on your earlier statements) coefficients with the wrong sign and coefficients that are so variable due to collinearity that they may be far away from the theoretical value. Maybe you want a mixed "empirical-hard model" model, but I have no idea how to get that, and I'm not even sure such a thing exists. (So I don't agree PROC REG is appropriate, it has the problems mentioned in this paragraph)”
For the sake of argument, consider that we may have developed as least a prototype of an “empirical-hard model” model. But how would you be able to tell a good empirical-hard model model from a poor one? Consider vetting the model with the classical Bradford Hill causality criteria used in epidemiology: (1) strength, (2) experiment (e.g., cross validation), (3) consistency, (4) biologic gradient (i.e., dose response), (5) temporality, (6) analogy, (7) plausibility, (8) specificity, and (9) coherence. Our empirical methodology aced all of these nine criteria. You can read a previous draft of our BMI formula paper to find our errors: https://www.medrxiv.org/content/10.1101/2020.07.27.20162487v1. As soon as my statistician co-author agrees to include all 25 risk factors and completes the cross validation step, we will update the preprint, which will be at the same link, hopefully, within a week or two. Preprints are not peer reviewed. It would be help us greatly to get the paper into external peer review at the Lancet if you and/or other SAS experts or aspiring experts would peer review the paper and post them as a comments.
Which brings us back to the very first question that I should have asked: what is the goal of this modeling? Is it to fit the data? Is it to confirm the BMI model holds on this data? Is it something else?
The goal of the modeling is to win the Nobel Prize in medicine. Kidding. All of the above—we need to fit the data, confirm the BMI model holds on this data, show that this new empirical methodology works for BMI and potentially for hundreds of other noncommunicable disease health outcomes, and get the paper published in a consequential journal like the Lancet.
This modeling also has a potential practical public health application of creating an app based on the BMI formula to provide feedback to individuals wanting to control their weight by diet and exercise. My website has a proof of concept prototype of such an app called, “Future body mass index (BMI) estimator based on diet and exercise.” This app is primitive compared to using IHME GBD data. I created BMI formulas from each of two databases (Diabetes Control and Complications Trial and World Health Organization/Food and Agriculture data) and merged the results of BMI modeling with these two databases into the framework for the app. You input your dietary pattern, exercise level, age, sex, height, and weight and it reports your estimated BMI in 1, 5, 10, and 20 years. You can then change components of your diet and/or exercise data to see what options would lead you to reach your weight control goal.
When someone asks about VIFs and regression modeling, I assume they are talking about empirical modeling and the goal of the modeling is to find a predictive model that fits the data, but now it sounds like that is not the goal.
Our goal only starts with finding a predictive model that fits the data. We have done that. And then it extends up to and including SAS students in college and elsewhere collaborating with us to model many more health outcomes with this methodology. These GBD data cover the years 1990-2017 and will become outdated when the 1990-2019 data (with everything updated) become available to GBD researchers. Eventually, more preprints of GBD data modeling health outcomes will lead to IHME realizing that using their data for statistically modeling health outcomes should complement their razor focus on systematic literature reviews to draw conclusions leading to public health policy strategies.
Spread the word!
Many thanks, Dr. Miller.
... View more