Quantile Regression: Beyond Modeling Conditional Means

We often compute quantiles like quartiles, deciles or percentiles to describe data in hand. Although quantiles are commonly used for data summarization, quantile regression is not frequently used as a statistical modeling technique. Whereas the Ordinary Least Squares (OLS) regression models the relationship between the covariates and the conditional mean of the response variable, the quantile regression extends the regression model to conditional quantiles of the response variable. If the assumptions of the ordinary least squares model hold true, quantile regression would be largely unnecessary, as a single set of regression coefficients would apply to all quantiles. In this case, an ordinary least squares model and its associated dispersion measures would adequately capture the key characteristics of the data.

Quantile Regression Model

Quantile regression uses a linear function model to fit the quantiles of a response variable conditional on the explanatory variables. The model does not assume a particular parametric distribution for the response

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The τth sample quantile is computed not from ordering the sample observations, but from a solution to an optimization problem. Quantile regression cannot be performed by merely dividing the unconditional distribution of the response variable and applying least squares fits to each subset. Instead, quantile regression utilizes all available data to fit the quantiles. The parameter estimates are derived by adjusting the least squares criterion. For median (i.e., 50^thpercentile) regression, this criterion minimizes the sum of the absolute residuals.

Fitting Quantile Regression Models using SAS Viya

Next, I will demonstrate how to fit quantile regression model using SAS Viya. For the purposes of demonstrations, I will be using Infant Birth Weight (SASHELP.BWEIGHT) data set. You can access this data set from SASHELP library. A public health study was conducted to examine the association of several covariates on birth weight. A random sample was taken of live, singleton births to mothers in the United States who were recorded as Black or White, and who were between the ages of 18 and 45. Using this data set, I will be predicting the birth weight because infants born at low weights are significantly more prone to health complications compared to those with more typical weights. Knowing that the response variable to predict is a continuous variable, I will first think of modeling the mean birth weight as a function of given covariates using OLS regression. But modeling the mean alone may not be adequate because you will see next that different factors are important in modeling different quantiles. The focus of such study lies in predicting which mothers are likely to have the lowest weight babies, and not the average birth weight of a particular group of mothers. Hence, I would fit a quantile regression model instead of a linear regression model. I will be using Model Studio for this analysis. Also, I am assuming that you are already familiar with the steps required to create a pipeline in Model Studio. If not, check out Build Models with SAS Model Studio | SAS Viya Quick Start Tutorial.

I start with a blank template, i.e. a pipeline that just has a data node and add a Quantile regression node. To do so, right click on Data node and select Add child node -> Supervised Learning ->Quantile Regression. In the Quantile Regression node properties the default value of 0.5 under Quantile option specifies that the node models median by default, however you can specify any quantile value between 0 and 1. Also, change the default value of selection method from Stepwise to Backward, i.e. Selection method -> Backward. Then under Selection options, change Model-selection criterion -> Average check loss validation. All other settings are kept at their default. The pipeline should resemble the following.

Next, I click Run Pipeline to run the entire pipeline. Right click on Quantile Regression node and select Results. On the Node tab, I expand the Output to see results from the QTRSELECT procedure. The results window displays basic details like Model information, Selection information, Class information tables etc. The Selection Information table displays selection method, select, stop and choose criteria as per selections made in the node’s options pane. Scroll further down and examine the Selection Details table. You can see that the best model is at step 3 where validation ACL (Average check loss) is minimum. The selected effects in the chosen model include- Intercept, MomAge, MomWtGain, Black, Boy, Married, and MomSmoke.

Table 1

Next, you see the parameter estimates table

04_MS_ParameterEst_0.5.jpg

Table 2

These parameter estimates can be used to predict median birth weight given the predictors and study the effect of covariates on median birth weight. But hang on, our aim is to fit models at different quantiles and determine if quantile regression produces a distinct set of parameter estimates and predictions for each quantile level. Assume that low, medium and high birth weight correspond to the 5th, 50th, and 95th percentiles of birth weight. To fit models at these three quantiles I will have to connect multiple Quantile Regression nodes to the Data node. But I wish to fit models at different quantiles in one go and be able to compare the parameter estimates of the models fitted at different quantiles. How can I accomplish this using a single node in pipeline? Well, I can build such custom models by writing SAS code using PROC QTRSELECT in SAS Code node. Return to pipeline and add a SAS code node to the Data node.

Right click on Data node and select Add child node -> Miscellaneous -> SAS Code. Then I rename the SAS Code node. To do this, right-click the SAS Code node and select Rename. Rename the node as Custom_QR. Next, Right-click the Custom_QR node and select Move -> Supervised Learning. This moves the node to the supervised learning Lane so that it gets treated like other modeling nodes. Click the Open Code Editor button in the SAS Code node properties pane and type the following code in Training code window

proc qtrselect data= &dm_data;

class %dm_binary_input %dm_nominal_input;

model %dm_dec_target = %dm_interval_input %dm_binary_input %dm_nominal_input/quantiles=(0.05 0.5 0.95) stb;

selection method=backward (select=SBC stop=SBC slstay=0.01 choose=validate) ;

partition role=role (validate='valid' train='train');

run;

Note the macro variables that are used to identify the binary, nominal, and interval input variables along with the target variable. The CLASS statement names the classification variables to be used as explanatory variables in the analysis. The MODEL statement names the dependent variable and the explanatory effects, including covariates, main effects, interactions, and nested effects. The QUANTILES option in Model Statement specifies quantile levels for the quantile regression model to fit and STB option displays the standardized estimates. The PARTITION statement specifies how observations in the input data set are logically partitioned into disjoint subsets for model training, validation, and testing. The ROLE= option names the variable in the input data set whose values are used to assign roles to each observation.

Next, I click Run Pipeline to run the entire pipeline. Then, I right click on Custom_QR node and select Results. Expand the Output window and examine the results.

The Selection Information table shows that the selection method is backward selection and the criterion to choose the best model is based on the SBC and validation average check loss (the loss function in quantile regression). The Stop Horizon value shows the number of consecutive steps at which the STOP= criterion must worsen in order for a local extremum to be detected. The default is 3, and you can change the value in the STOPHORIZON= option.

The Selection Summary table shows the SBC value at each step of the backward elimination method. The Average Check Loss (ACL) is the error sum of squares divided by the number of observations. Note that the selected effects in the chosen model include- Intercept, CigsPerDay, MomAge, MomWtGain, Black, Boy, Married, MomSmoke, MomEduLevel, and Visit.

Next, I examine the parameter estimates table for the selected model at quantile level = 0.05.

07_MS_P_Est_0.05.jpg

The standardized estimate column values suggest that among the other variables the most important variables for the 0.05 quantile model are mother’s weight gain (a one-unit increase corresponds to a 13.14 grams increase in infant birth weight), whether the mother was Black (babies born to white women are nearly 293 grams heavier than those born to Black women), and whether the mother was married (babies whose mothers were not married weigh nearly 113 grams less than those whose mothers were married).

The parameter estimate table for the model at quantile level=0.5 (median) are already discussed in the table 2. As you might notice the covariates and their parameter estimates are different for the two selected models.

Finally, I examine the model at quantile level= 0.95 (i.e. 95th percentile).

08_MS_Selection-details0.95.jpg

09_MS_P_Est_0.95.jpg

Among others, the most important variables in the 0.95 quantile model are mother’s weight gain (a one-unit increase corresponds to a 6.89 gram increase in infant birth weight), whether the mother smokes (mothers who do not smoke have their babies 201.99 grams heavier than the smoking mothers), and whether the baby was a boy (boys weigh more than the girls by nearly 126 grams). Notice that the effect of the variables in the 0.05 quantile model, 0.5 quantile model and 0.95 quantile model is very different from one another.

Fitting a quantile regression model is especially useful if you want to understand which features are important in predicting the target at its certain quantiles. You might want to know what predicts median values (50th percentile), as opposed to an ordinary least squares model that is based on predicting the mean of the target variable. Similarly, you might want to model lower or higher quantiles of the target variable to understand the features that are important at those specific quantiles.

Comparison of Linear Regression and Quantile Regression

Having discussed the basics of quantile regression, here is a quick listing of differences between linear regression and quantile regression

Linear Regression	Quantile Regression
Predicts the conditional mean	Predicts conditional quantiles
Makes distributional assumption about the error term	Is distribution agnostic
Assumes homoscedasticity	Can accommodate heteroscedasticity
Is sensitive to outliers	Is robust to outliers in response direction
Works well when sample is small and is computationally inexpensive	Needs sufficient data and is computationally intensive

Summary

When the data is distributed differently across each quantile of the dataset, it can be beneficial to apply distinct regression models tailored to the specific characteristics of each quantile, rather than attempting to use a single model to predict the conditional mean for all. In these instances, the coefficients of the various quantile models will vary from one another.

References

Find more articles from SAS Global Enablement and Learning here.