<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Text Length, Population Stability Index and Language detection in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/Text-Length-Population-Stability-Index-and-Language-detection/m-p/832893#M10303</link>
    <description>&lt;P&gt;I'm training a language detection model using:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;a training set, classified between English and not English sentences or small paragraps, where the length of the sentences can vary&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;a score set that is derived from web scraping activities, which length is usually more extensive than the length of the training set (a blog entry as example)&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Now, the language identification model work with bigrams (combos of all the sequence of two characters extracted from text each set), which then being continuous variable, are re-classified into bins using deciles.&lt;/P&gt;&lt;P&gt;The upper and lower limits defining each bin and derived from the training set are then used to re-classify the score set' bigrams into bins too.&lt;/P&gt;&lt;P&gt;Now, being above the reclassification process, if in the training set I would extract each sequence independently, I would end up having much shorter text lengths than the average score set' entry, meaning that:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;the BINS extracted from the training set would have a different distribution in the training set (skewed to left, as the lengths are shorter, therefore corresponding to lower frequencies of the bigrams)&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;than in the score set (skewed to right, as being longer text lengths, the bigrams would end up being all re-classified in the higher bins)&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Now, this would generate a not stable Population Stability Index, and the training model therefore would be impacted by that.&lt;/P&gt;&lt;P&gt;So, considering a scenario where the training text dataset available is different in length of sentences than the scoring text dataset, in order to have a stable PSI, how should I proceed?&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;Combine together the sentences/paragraphs in the training set in order to reach an average text length similar to the average text length in the score set?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Any other suggestion/material which would explain how to best approach the subject?&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
    <pubDate>Mon, 12 Sep 2022 13:29:10 GMT</pubDate>
    <dc:creator>dcortell</dc:creator>
    <dc:date>2022-09-12T13:29:10Z</dc:date>
    <item>
      <title>Text Length, Population Stability Index and Language detection</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Text-Length-Population-Stability-Index-and-Language-detection/m-p/832893#M10303</link>
      <description>&lt;P&gt;I'm training a language detection model using:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;a training set, classified between English and not English sentences or small paragraps, where the length of the sentences can vary&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;a score set that is derived from web scraping activities, which length is usually more extensive than the length of the training set (a blog entry as example)&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Now, the language identification model work with bigrams (combos of all the sequence of two characters extracted from text each set), which then being continuous variable, are re-classified into bins using deciles.&lt;/P&gt;&lt;P&gt;The upper and lower limits defining each bin and derived from the training set are then used to re-classify the score set' bigrams into bins too.&lt;/P&gt;&lt;P&gt;Now, being above the reclassification process, if in the training set I would extract each sequence independently, I would end up having much shorter text lengths than the average score set' entry, meaning that:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;the BINS extracted from the training set would have a different distribution in the training set (skewed to left, as the lengths are shorter, therefore corresponding to lower frequencies of the bigrams)&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;than in the score set (skewed to right, as being longer text lengths, the bigrams would end up being all re-classified in the higher bins)&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Now, this would generate a not stable Population Stability Index, and the training model therefore would be impacted by that.&lt;/P&gt;&lt;P&gt;So, considering a scenario where the training text dataset available is different in length of sentences than the scoring text dataset, in order to have a stable PSI, how should I proceed?&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;Combine together the sentences/paragraphs in the training set in order to reach an average text length similar to the average text length in the score set?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Any other suggestion/material which would explain how to best approach the subject?&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Mon, 12 Sep 2022 13:29:10 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Text-Length-Population-Stability-Index-and-Language-detection/m-p/832893#M10303</guid>
      <dc:creator>dcortell</dc:creator>
      <dc:date>2022-09-12T13:29:10Z</dc:date>
    </item>
  </channel>
</rss>

