BookmarkSubscribeRSS Feed

Why do machine learning models in SAS Viya for Learners sometimes yield different answers?

Started ‎03-17-2023 by
Modified ‎07-26-2023 by
Views 1,166

This is a common question that we get from academics – who were trained on modeling procedures such as Ordinary Least Squares (OLS) that yield the same answer every time the code is submitted.  A reframe of the question is: I submit a machine learning model (like a gradient boosting model) and get one answer.  I then submit it a second time – and get a slightly different answer.  What gives?

 

The simple answer: machine learning models are special.  Very special.

 

But that’s not a sufficient response.  So, let’s start with a slide from a SAS training course and then break it down a bit:

 

LGroves_0-1679065701790.png

 

With machine learning and predictive modeling, we’re typically moving away from inferential statistics, which often provide deterministic results (i.e., b2 = 0.22… and we should expect that every time we run the model). When departing from the traditional tools of inferential statistics, we need to adjust our mindset.  Machine learning models often produce nondeterministic results – i.e., which means that results should change a bit each time we press the submit button.  With non-deterministic results, we essentially think about “converging” or “estimating” a model, rather than computing an exact value.  Yes, machine learning and predictive models with their nondeterministic results are very much in the Bayesian spirit of things.

 

Being a bit less abstract, there are a few reasons why we’ll get marginally different results each time we run a machine learning algorithm:

 

  • The underlying training, validation, and testing samples change each time
    • This is, of course, unless we fix the samples with a seed
  • The algorithm is purposely adding randomness into the equation
    • A great example of this are gradient boosting models.  The randomness is how it improves… and is how the machine actually learns. (See what I did there?)
  • When data are distributed across several CPUs, this can also lead to slightly different results
    • Why?  Because while the mean of the mean is the global mean, the mode of the mode is not always the global mode. Ponder that for a bit.

 

So – to both academics and industry hackers alike – I say to you: don’t fret if your models are changing slightly.  The goal of machine learning and predictive models isn’t to have precise estimates on the right-hand side (i.e., explanatory/independent) variables.  Instead, the objective is to best predict future events.  And several models could do that equally well.

Version history
Last update:
‎07-26-2023 04:21 PM
Updated by:

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags