Deep Dive on the mitigateBias Action: Hyperparameter Tuning

1 Like

This blog will do a deeper dive on the mitigateBias action and how to tune it. The SAS mitigateBias uses a highly versatile approach (exponential gradient reduction) that:

Applies to any classifier family
Allows many definitions of fairness, including demographic parity and equalized odds

The approach is based on A Reductions Approach to Fair Classification by Agarwal et al. 2018 and results in a randomized classifier with the minimum empirical error subject to the fairness constraints selected by the user. For a more general introduction to assessing and mitigating bias using the assessBias and mitigateBias actions of the SAS Fair AI tool set, see my previous blog Mitigating Bias Using SAS Fair AI Tools.

Exponential Gradient Reduction Algorithm

The brains of the approach is the Exponential Gradient Reduction Algorithm (EGR), which is model agnostic. That means you can use it whether your model is gradient boosting, random forest, logistic regression, neural network, decision tree, etc. EGR works by iteratively manipulating the weights for different subsections of the data. For a simple example, let’s use a binary target of BAD (1 means loan default, 0 means no default) and a binary sensitive variable of female/male for gender. Thus we have four sections:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

By default the EGR of the mitigateBias action will manipulate these weights using the training data only. When the model is trained, each of the segments will have a different performance. The EGR algorithm works to compensate for bias by increasing the weights of segments that underperform, and decreasing the weights of segments that overperform. This is an iterative process with the learning rate and the bound working in tandem to iteratively achieve the best result.

When the weights get updated, the model compensates for the change in weights, and the performance for each segment changes. The effect of the weight depends on many things, including:

What model you are using (gradient boosting, random forest, etc.)
The values of the observations in your data set in each segment
The number of observations in your data set in each segment

Hyperparameter tuning the mitigateBias action in SAS Viya

If you don’t get satisfactory results with your data using the defaults, you can adjust the hyperparameters.

LogLevel

The logLevel specifies how much log information to print. The default is currently 0 but will change to 1 in the next release (2023.11). The logLevel meanings in the updated version will be:

0: Suppress most notes, allow warnings and errors

1: Default, level 0 + a few procedural and informative notes

2: Level 1 + some more informative notes and a few notes per iteration (if you want notes per iteration this is the level to select)

3 and above: Allows notes to print from the training program, generally used for debugging

There’s no real difference between levels 3 – 7 at this point in mitigateBias. The additional levels are reserved for future use.

Best practice is to set it at 5 for initial debugging. Once you have debugged, set it back to 1 to avoid filling up your log. Be sure to clear the log after each run.

PROTIP from Xin Hunt: If you have “print” statements in the training program and want to see them, either set logLevel to >= 3 (so all notes will be printed), or print them as a warning or error (e.g., print (warn) “some message”;). Warnings and errors are always shown regardless of logLevel.

Side note: For mitigateBiasDecisionTree (and all future mitigateBiasXXX actions), log levels 4 and 5 print additional things, so the effective highest log level is 5. However, this doesn’t affect mitigateBias.

How many iterations?

The default value for maxIters is ten. You can set it to the maximum of 50 if you have the time available. A high number of iterations is useful if you are working with a new data set, working on debugging, and if you need to tune some of the other hyperparameters.

In the ideal situation you will see that gradually over five to ten iterations the EGR will put the segments on an equal footing. But CAUTION! If it converges in just one iteration that may be a spurious result. This result may not be replicable with a new data set and thus will not be generalizable to the full population. On the other hand, if it takes too many iterations to converge then you may be wasting time, and you may want to increase the bound (see below).

Tolerance

Recall that there is commonly a tradeoff between accuracy and bias. Tolerance is based on how much bias the user is willing to tolerate and determines an acceptable level for stopping of the EGR. The acceptable tolerance will depend on your domain (e.g., regulated industry) and on your data. If you have highly unbalanced data, for example, you may be willing to accept a a higher tolerance in order to get results.

Tolerance is currently calculated as compared to the global average for that particular metric. For example, with predictive parity the tolerance will be calculated as the absolute value of the segment prediction minus the global average prediction. For accuracy it would be the absolute value of the segment accuracy minus the global average accuracy.

Bound, learningRate, and tuneBound

EGR is an iterative process with the learning rate (default 0.01) and the bound (default 100) working in tandem to achieve the best result over multiple iterations. There are no hard and fast rules because the effect of adjusting the bound and learning rate will depend on the model you are using (e.g., gradient boosting, logistic regression, etc.) and your data.

One best practice for tuning these hyperparameters is to initially leave the learning rate at the default, and adjust only the bound. The larger the bound, the more the weight changes on each iteration. Conversely, the smaller the bound, the less the weight changes on each iteration.

Segments with small numbers will be highly affected by weight changes. Thus if you have a high bound value, the performance/accuracy of underrepresented segments will be highly volatile (jump around a lot) leading to overshooting by the EGR. This is less of an issue with balanced data. However if the bound variable is very high, you will see this volatility even with balanced data. If bound is too low, you may see no change with each iteration, and you may never converge.

NOTE: In earlier releases of the software (LTS 2022.03 and earlier), there was an upper bound of 5,000 that could not be exceeded. Particularly in this older software, if you hit the maximum bound and still have not received satisfactory results, it may be helpful to adjust the learning rate instead. A higher learning rate may overshoot and a lower learning rate may undershoot.

If you are having trouble finding a good bound you can set TuneBound=TRUE to get a good ball park starting place to set the bound. From there you can then manually adjust to get even better results if you have time to tinker with the tuning.

TuneBound is a relatively simple meta-algorithm wrapped around EGR that sets the bound value to very high for 5 maxIter and to very low for 5 iterations. It then analyzes what the weights did in each of those 5 maxIterations. Weight changes are expected to be volatile/jumpy at the high bound. With the low bound, we would expect nonconvergence. If both situations (volatile at high and nonconvergent at low) are true, tuneBound moves the bounds inward (high bound not so high, and low bound not so low). Eventually we get results where the weights are stable at the high bound and we have convergance at the low bound. At that point, tuneBound takes the average of the two and sets the bound to that average.

Tuning will not always work

Remember that old piano that sat in a damp place for years and now the wood is warped so it can’t be tuned? Just like that piano, it’s possible that your model may not converge no matter how much tuning you do.

Recognize that if you have too few observations in a segment it just won’t work. Not even with 50 iterations, not even if you play with both the bound and the learningRate. Take the extreme example of a segment of only 1 observation. Then TPR will be either 0 or 1, it can’t be anything in between. If you compare this to another segment that has 1000 observations, it will have a TPR somewhere in between 0 and 1 and your tiny segment will never match that TPR. The larger the data set and the more balanced the data (equal proportions in the different segments), the better.

Also if you have extreme bias in the data then might not be able to overcome that. For example if one segment is 100% 1s or 100% 0s in the target variable. In this case, you need to go back to the drawing board: look at your training data, evaluate the goals of the modeling effort, and so on.

Interpreting Results

Demographic (statistical) parity is achieved when the probability of a positive prediction is similar for those in different sensitive groups in the training data. With a binary target, demographic parity is satisfied for a classifier h if the prediction h(X) is statistically independent of the sensitive variable A under a distribution (X, A, Y). In other words, for a binary classifier where ŷ ∈ {0,1} if

E[h(X)|A=a] = E[h(X)] for all a.

In our simple example with gender (female/male) as the binary sensitive variable and the binary target BAD (1/0) , this means that the expected value of h(X) given gender = female should equal the expected value of h(X) given gender = male.

Here we see an example where by Iteration 3, demographic parity was low (0.001929), without much sacrifice in misclassification rate (went from 0.3330 to 0.3409).

Equalized odds. There are knows flaws with demographic parity, therefore some practitioners prefer to use equalized odds. Equalized odds match both the true positive rates AND the false positive rates for different sensitive groups. A classifier h satisfies equalized odds if h(X) is conditionally independent of the sensitive variable A given Y , i.e.,

E[h(X)|A=a, Y=y] = E[h(X)|Y=y] for all a, y.

Below, we see that we converge in 11 iterations at a tolerance of 0.18. We see a nice pattern of the equalized odds reducing but the misclassification rate not getting worse.

Thank you to my colleague Randy Collica for the data and results shown here.

Equal opportunity

Equal opportunity considers only true positive rates (TPR), and it does not take into consideration false positive rate.

PRO TIP from my colleague Xinmin Wu: Demographic parity and equal opportunity will be easier to tune than equalized odds. This is because equalized odds deals with two metrics (TPR and FPR) simultaneously, and fine-tuning them can be challenging because improving one may worsen the other. Thus the current bound-tuning logic works best for demographic parity or equal opportunity. FYI, this bound tuning feature is not available in the open-source package Fairlearn; SAS has three US patents issued pertaining to it.

FOR MORE INFORMATION

Agarwal, A., Beygelzimer, A., Dudik, M., Langford, J., and Wallach, H. (2018). “A Reductions Approac....

SAS Documentation

ACKNOWLEDGEMENTS

Thank you to the following SAS employees who provided details, data, and questions to help provide a deeper dive into the mitigateBias action.

Ricky Tharrington
Xin Hunt
Allie DeLonay
Xinmin Wu
Randy Collica