Why do machine learning models in SAS Viya for Learners sometimes yield different answers?

7 Likes

This is a common question that we get from academics – who were trained on modeling procedures such as Ordinary Least Squares (OLS) that yield the same answer every time the code is submitted. A reframe of the question is: I submit a machine learning model (like a gradient boosting model) and get one answer. I then submit it a second time – and get a slightly different answer. What gives?

The simple answer: machine learning models are special. Very special.

But that’s not a sufficient response. So, let’s start with a slide from a SAS training course and then break it down a bit:

With machine learning and predictive modeling, we’re typically moving away from inferential statistics, which often provide deterministic results (i.e., b2 = 0.22… and we should expect that every time we run the model). When departing from the traditional tools of inferential statistics, we need to adjust our mindset. Machine learning models often produce nondeterministic results – i.e., which means that results should change a bit each time we press the submit button. With non-deterministic results, we essentially think about “converging” or “estimating” a model, rather than computing an exact value. Yes, machine learning and predictive models with their nondeterministic results are very much in the Bayesian spirit of things.

Being a bit less abstract, there are a few reasons why we’ll get marginally different results each time we run a machine learning algorithm:

The underlying training, validation, and testing samples change each time
- This is, of course, unless we fix the samples with a seed
The algorithm is purposely adding randomness into the equation
- A great example of this are gradient boosting models. The randomness is how it improves… and is how the machine actually learns. (See what I did there?)
When data are distributed across several CPUs, this can also lead to slightly different results
- Why? Because while the mean of the mean is the global mean, the mode of the mode is not always the global mode. Ponder that for a bit.

So – to both academics and industry hackers alike – I say to you: don’t fret if your models are changing slightly. The goal of machine learning and predictive models isn’t to have precise estimates on the right-hand side (i.e., explanatory/independent) variables. Instead, the objective is to best predict future events. And several models could do that equally well.

Why do machine learning models in SAS Viya for Learners sometimes yield different answers?

Free course: Data Literacy Essentials

Get Started