I'm training a language detection model using:
a training set, classified between English and not English sentences or small paragraps, where the length of the sentences can vary
a score set that is derived from web scraping activities, which length is usually more extensive than the length of the training set (a blog entry as example)
Now, the language identification model work with bigrams (combos of all the sequence of two characters extracted from text each set), which then being continuous variable, are re-classified into bins using deciles.
The upper and lower limits defining each bin and derived from the training set are then used to re-classify the score set' bigrams into bins too.
Now, being above the reclassification process, if in the training set I would extract each sequence independently, I would end up having much shorter text lengths than the average score set' entry, meaning that:
the BINS extracted from the training set would have a different distribution in the training set (skewed to left, as the lengths are shorter, therefore corresponding to lower frequencies of the bigrams)
than in the score set (skewed to right, as being longer text lengths, the bigrams would end up being all re-classified in the higher bins)
Now, this would generate a not stable Population Stability Index, and the training model therefore would be impacted by that.
So, considering a scenario where the training text dataset available is different in length of sentences than the scoring text dataset, in order to have a stable PSI, how should I proceed?
Combine together the sentences/paragraphs in the training set in order to reach an average text length similar to the average text length in the score set?
Any other suggestion/material which would explain how to best approach the subject?