<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Splitting string into separate sentences? in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Splitting-string-into-separate-sentences/m-p/830407#M328112</link>
    <description>&lt;P&gt;I'm working with qualitative data, where a variable COMMENT represents a string of text written in by a study participant. Some comments aren't in full sentences, but some have several sentences.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I've figured out how to split the string at any period it finds into a separate row &amp;amp; retain an ID value for that (so I can manually match up the comment "pieces" to the original full comment if needed):&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;	do i=1 by 1 while(scan(COMMENT,i,'.')^=' ');
	new=scan(COMMENT,i,'.');
	retain ID;
	output;
	end;&lt;/PRE&gt;
&lt;P&gt;While looking through my first run of this code, I realized there is a problem: some comments will include things like "Mr." or a decimal like "12.1".&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What's the easiest way to add a layer to this code that will split the string only at the end of an actual sentence (delimited by a period)?&lt;/P&gt;</description>
    <pubDate>Thu, 25 Aug 2022 18:55:23 GMT</pubDate>
    <dc:creator>SAS93</dc:creator>
    <dc:date>2022-08-25T18:55:23Z</dc:date>
    <item>
      <title>Splitting string into separate sentences?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Splitting-string-into-separate-sentences/m-p/830407#M328112</link>
      <description>&lt;P&gt;I'm working with qualitative data, where a variable COMMENT represents a string of text written in by a study participant. Some comments aren't in full sentences, but some have several sentences.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I've figured out how to split the string at any period it finds into a separate row &amp;amp; retain an ID value for that (so I can manually match up the comment "pieces" to the original full comment if needed):&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;	do i=1 by 1 while(scan(COMMENT,i,'.')^=' ');
	new=scan(COMMENT,i,'.');
	retain ID;
	output;
	end;&lt;/PRE&gt;
&lt;P&gt;While looking through my first run of this code, I realized there is a problem: some comments will include things like "Mr." or a decimal like "12.1".&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What's the easiest way to add a layer to this code that will split the string only at the end of an actual sentence (delimited by a period)?&lt;/P&gt;</description>
      <pubDate>Thu, 25 Aug 2022 18:55:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Splitting-string-into-separate-sentences/m-p/830407#M328112</guid>
      <dc:creator>SAS93</dc:creator>
      <dc:date>2022-08-25T18:55:23Z</dc:date>
    </item>
    <item>
      <title>Re: Splitting string into separate sentences?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Splitting-string-into-separate-sentences/m-p/830662#M328222</link>
      <description>&lt;P&gt;To reliably tokenize a body of text into sentences, you need a tool that understands how to break your data into its discrete concepts. SAS Text Analytics has this ability -- &lt;A href="https://blogs.sas.com/content/sgf/2018/07/26/how-to-tokenize-documents-into-sentences/" target="_self"&gt;see this article for details&lt;/A&gt;. These tools can also help with concept categorization and sentiment.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'm assuming you don't have access to that so you really would just have to code for the most common special cases. Like the "Mr." or "12.1" cases you mentioned -- having your code recognize these patterns and&amp;nbsp;&lt;STRONG&gt;not&lt;/STRONG&gt; break the sentence on those boundaries. The list of exceptions could get pretty large if your text is diverse.&lt;/P&gt;</description>
      <pubDate>Fri, 26 Aug 2022 18:21:28 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Splitting-string-into-separate-sentences/m-p/830662#M328222</guid>
      <dc:creator>ChrisHemedinger</dc:creator>
      <dc:date>2022-08-26T18:21:28Z</dc:date>
    </item>
  </channel>
</rss>

