SAS Data Science

Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Viya (Machine Learning), SAS Visual Text Analytics, with point-and-click interfaces or programming
BookmarkSubscribeRSS Feed
dcortell
Pyrite | Level 9

I'm training a language detection model using:

  • a training set, classified between English and not English sentences or small paragraps, where the length of the sentences can vary

  • a score set that is derived from web scraping activities, which length is usually more extensive than the length of the training set (a blog entry as example)

Now, the language identification model work with bigrams (combos of all the sequence of two characters extracted from text each set), which then being continuous variable, are re-classified into bins using deciles.

The upper and lower limits defining each bin and derived from the training set are then used to re-classify the score set' bigrams into bins too.

Now, being above the reclassification process, if in the training set I would extract each sequence independently, I would end up having much shorter text lengths than the average score set' entry, meaning that:

  • the BINS extracted from the training set would have a different distribution in the training set (skewed to left, as the lengths are shorter, therefore corresponding to lower frequencies of the bigrams)

  • than in the score set (skewed to right, as being longer text lengths, the bigrams would end up being all re-classified in the higher bins)

Now, this would generate a not stable Population Stability Index, and the training model therefore would be impacted by that.

So, considering a scenario where the training text dataset available is different in length of sentences than the scoring text dataset, in order to have a stable PSI, how should I proceed?

  1. Combine together the sentences/paragraphs in the training set in order to reach an average text length similar to the average text length in the score set?

  2. Any other suggestion/material which would explain how to best approach the subject?

sas-innovate-white.png

Join us for our biggest event of the year!

Four days of inspiring keynotes, product reveals, hands-on learning opportunities, deep-dive demos, and peer-led breakouts. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 0 replies
  • 427 views
  • 0 likes
  • 1 in conversation