BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
NicolasC
Fluorite | Level 6

Hi there

 

I may have what sounds like a stupid question but in SEMMA methodology, why sampling is first?

In other words, if I first manipulate my large data (imputing missing values/binning interval data etc...) and then after perform a sampling on this data to create my model is that complete non-sense?

Thanks

Nicolas 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PadraicGNeville
SAS Employee

Well, if there is one sample to do the analysis and another sample held out to evaluate the results of the analysis, and missing values are imputed using all the data, then the evaluation data is not completely independent if there are a bunch of missing values.  So, better practice, and practice simpler to explain and possibly avoid criticisms of the results,  is to impute & bin on each sample separately.   

That said, it often doesn't matter.  It's an art.

View solution in original post

3 REPLIES 3
PadraicGNeville
SAS Employee

Your approach is fine. I came up with "SEMMA" as an easily remembered guide for those who have little analytical experience.  People with analytical experience will do what they know best.

-Padraic

NicolasC
Fluorite | Level 6

Thank for your answer Padraic. The reason I asked is because I never came across (in my non-exhaustive search) work where the sampling was not performed straight on the raw imported full data-set. Nicolas

PadraicGNeville
SAS Employee

Well, if there is one sample to do the analysis and another sample held out to evaluate the results of the analysis, and missing values are imputed using all the data, then the evaluation data is not completely independent if there are a bunch of missing values.  So, better practice, and practice simpler to explain and possibly avoid criticisms of the results,  is to impute & bin on each sample separately.   

That said, it often doesn't matter.  It's an art.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1307 views
  • 0 likes
  • 2 in conversation