<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Matching text to sentences - Outputting same Matches Separately instead of Collectively in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Matching-text-to-sentences-Outputting-same-Matches-Separately/m-p/502447#M134127</link>
    <description>&lt;P&gt;I've prepared a modified version of TABLE_TWO&amp;nbsp;where "good day" doesn't match because of a non-breakable space.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To investigate the issue I would add variable TEXT to the SELECT statement:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;proc sql;
create table test as
select a.*&lt;FONT color="#FF0000"&gt;, text&lt;/FONT&gt; from table_one as a, table_two as b
where find(a.Sentence, b.Text)&amp;gt;0;
quit;&lt;/PRE&gt;
&lt;P&gt;In the result (dataset TEST) I'd&amp;nbsp;notice that "good day" is missing in column TEXT.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Then I would display the non-matching values&amp;nbsp;("A very good day" and what&amp;nbsp;should be&amp;nbsp;"good day") in hexadecimal format:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc print data=table_one(firstobs=3 obs=3);
format sentence $hex32.;
run;

proc print data=table_two(firstobs=2 obs=2);
format text $hex18.;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;(The format widths, 32 and 18, must be at least twice the lengths of the strings to be displayed, which are 15 and 8, resp., in this example.)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;A comparison of the PROC PRINT results would reveal the issue:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;Obs    sentence

 3     41207665727920&lt;STRONG&gt;&lt;FONT color="#008000"&gt;676F6F64&lt;/FONT&gt;&lt;FONT color="#FF0000"&gt;20&lt;/FONT&gt;&lt;FONT color="#008000"&gt;646179&lt;/FONT&gt;&lt;/STRONG&gt;20

Obs    text

 2     &lt;STRONG&gt;&lt;FONT color="#008000"&gt;676F6F64&lt;/FONT&gt;&lt;FONT color="#FF0000"&gt;A0&lt;/FONT&gt;&lt;FONT color="#008000"&gt;646179&lt;/FONT&gt;&lt;/STRONG&gt;20
&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To resolve the issue I'd create a "clean" version of TABLE_TWO, in this case with non-breakable spaces (&lt;FONT face="courier new,courier"&gt;'A0'x&lt;/FONT&gt;) replaced by ordinary space characters (&lt;FONT face="courier new,courier"&gt;'20'x = ' '&lt;/FONT&gt;&lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data clean_table_two;
set table_two;
text=translate(text,' ','A0'x);
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;With CLEAN_TABLE_TWO in place of TABLE_TWO the result in dataset MATCH should be as expected, unless there are more issues.&lt;/P&gt;</description>
    <pubDate>Mon, 08 Oct 2018 16:36:25 GMT</pubDate>
    <dc:creator>FreelanceReinh</dc:creator>
    <dc:date>2018-10-08T16:36:25Z</dc:date>
    <item>
      <title>Matching text to sentences - Outputting same Matches Separately instead of Collectively</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-text-to-sentences-Outputting-same-Matches-Separately/m-p/502420#M134115</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I have the following objective that I want to accomplish:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have two tables. One of the tables (Table_one) has a column called 'Sentence'.&lt;BR /&gt;It has the values as follows below:&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; SENTENCE&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; I live in New York&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; A bad day&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; A very good day&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have another table (Table_two) with a column called 'Text' in the form:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; TEXT&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; New York&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; good day&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; very good day&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I want to match phrases in 'Text' to sentences in 'Sentences' to see if they are contained in any of the 'sentences' observations. I want to output those sentences that do contain the text.&lt;/P&gt;&lt;P&gt;I understand that this s not difficult in and of itself, but I have a unique case that I could not find much info online.&lt;/P&gt;&lt;P&gt;What I want is a table that results in:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; MATCH&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; I live in New York&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; A very good day&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; A very good day&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I've tried the following code:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; proc sql;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; create table match as&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; select a.* from table_one as a, table_two as b&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; where find(a.Sentence, b.Text)&amp;gt;0&lt;BR /&gt;;&lt;BR /&gt;run;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&lt;BR /&gt;What I get is the below result:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; MATCH&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; I live in New York&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; A very good day&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In other words, since the observations in Table_two: 'good day' &amp;amp; 'very good day' both are contained in the sentence of Table_one: 'A very good day', it is treated like a single observation and returned only once in the output.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I however would like both of the phrases to be treated like individual observations and be output twice like my desired output.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have tried both the FIND() and INDEX() functions.&lt;BR /&gt;But both give me the same results.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there anyway to avoid the single observation output and get two separate observations even if there are phrases in the same sentence?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any help would be greatly appreciated.&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Oct 2018 14:44:28 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-text-to-sentences-Outputting-same-Matches-Separately/m-p/502420#M134115</guid>
      <dc:creator>akvinay7</dc:creator>
      <dc:date>2018-10-08T14:44:28Z</dc:date>
    </item>
    <item>
      <title>Re: Matching text to sentences - Outputting same Matches Separately instead of Collectively</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-text-to-sentences-Outputting-same-Matches-Separately/m-p/502428#M134118</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/178252"&gt;@akvinay7&lt;/a&gt;&amp;nbsp;and welcome to the SAS Support Communities!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I can't replicate your result. The below code results in three observations including the duplicate "A very good day", as it should.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data table_one;
input sentence $50.;
cards;
I live in New York
A bad day
A very good day
;

data table_two;
input text $30.;
cards;
New York
good day
very good day
;

proc sql;
create table match as
select a.* from table_one as a, table_two as b
where find(a.Sentence, b.Text)&amp;gt;0;
quit;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;What do you get&amp;nbsp;if you copy and run the above code? Maybe there is an invisible character (e.g. a non-breakable space in place of an ordinary space) in your TABLE_TWO so that one of the two anticipated matches fails.&lt;/P&gt;</description>
      <pubDate>Mon, 08 Oct 2018 15:17:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-text-to-sentences-Outputting-same-Matches-Separately/m-p/502428#M134118</guid>
      <dc:creator>FreelanceReinh</dc:creator>
      <dc:date>2018-10-08T15:17:19Z</dc:date>
    </item>
    <item>
      <title>Re: Matching text to sentences - Outputting same Matches Separately instead of Collectively</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-text-to-sentences-Outputting-same-Matches-Separately/m-p/502447#M134127</link>
      <description>&lt;P&gt;I've prepared a modified version of TABLE_TWO&amp;nbsp;where "good day" doesn't match because of a non-breakable space.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To investigate the issue I would add variable TEXT to the SELECT statement:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;proc sql;
create table test as
select a.*&lt;FONT color="#FF0000"&gt;, text&lt;/FONT&gt; from table_one as a, table_two as b
where find(a.Sentence, b.Text)&amp;gt;0;
quit;&lt;/PRE&gt;
&lt;P&gt;In the result (dataset TEST) I'd&amp;nbsp;notice that "good day" is missing in column TEXT.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Then I would display the non-matching values&amp;nbsp;("A very good day" and what&amp;nbsp;should be&amp;nbsp;"good day") in hexadecimal format:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc print data=table_one(firstobs=3 obs=3);
format sentence $hex32.;
run;

proc print data=table_two(firstobs=2 obs=2);
format text $hex18.;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;(The format widths, 32 and 18, must be at least twice the lengths of the strings to be displayed, which are 15 and 8, resp., in this example.)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;A comparison of the PROC PRINT results would reveal the issue:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;Obs    sentence

 3     41207665727920&lt;STRONG&gt;&lt;FONT color="#008000"&gt;676F6F64&lt;/FONT&gt;&lt;FONT color="#FF0000"&gt;20&lt;/FONT&gt;&lt;FONT color="#008000"&gt;646179&lt;/FONT&gt;&lt;/STRONG&gt;20

Obs    text

 2     &lt;STRONG&gt;&lt;FONT color="#008000"&gt;676F6F64&lt;/FONT&gt;&lt;FONT color="#FF0000"&gt;A0&lt;/FONT&gt;&lt;FONT color="#008000"&gt;646179&lt;/FONT&gt;&lt;/STRONG&gt;20
&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To resolve the issue I'd create a "clean" version of TABLE_TWO, in this case with non-breakable spaces (&lt;FONT face="courier new,courier"&gt;'A0'x&lt;/FONT&gt;) replaced by ordinary space characters (&lt;FONT face="courier new,courier"&gt;'20'x = ' '&lt;/FONT&gt;&lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data clean_table_two;
set table_two;
text=translate(text,' ','A0'x);
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;With CLEAN_TABLE_TWO in place of TABLE_TWO the result in dataset MATCH should be as expected, unless there are more issues.&lt;/P&gt;</description>
      <pubDate>Mon, 08 Oct 2018 16:36:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-text-to-sentences-Outputting-same-Matches-Separately/m-p/502447#M134127</guid>
      <dc:creator>FreelanceReinh</dc:creator>
      <dc:date>2018-10-08T16:36:25Z</dc:date>
    </item>
    <item>
      <title>Re: Matching text to sentences - Outputting same Matches Separately instead of Collectively</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-text-to-sentences-Outputting-same-Matches-Separately/m-p/502503#M134149</link>
      <description>&lt;P&gt;Thank you so much&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/32733"&gt;@FreelanceReinh&lt;/a&gt; for your thorough investigation and welcome!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I had tried your code out, and I was getting the duplicate observation like you mentioned, which had me baffled.&lt;/P&gt;&lt;P&gt;This made me believe the culprit was something to with incorrect variable lengths.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I was reading my input from datasets, which I believe may have had those non-breakable spaces like you mentioned.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I do not know what kind of encoding they have used, but the data was 'messy' to say the least. I identified the 'messiness' using&lt;/P&gt;&lt;P&gt;the hexadecimal conversion idea like you described, and as you expected, there were many non '20' entries within the phrases to be matched.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have created a script using your code to convert those to wrong values to '20' - or essentially 'clean' the dataset.&amp;nbsp;&lt;/P&gt;&lt;P&gt;It took some time to clean the dataset, as there were a lot of different types of wrong codes in between.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I also found out that there were unnecessary trailing and ending spaces. So, I used the TRIM() function. The proc sql code was re-written below:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;proc sql; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;create table test as select a.* &lt;/SPAN&gt;&lt;SPAN&gt;from table_one as a, table_two as b &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;where find(a.Sentence, TRIM(b.Text))&amp;gt;0; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;quit;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;This gave me all the right values with duplicated observations when applicable.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I am "accepting" this solution as the right solution as this helped me identify, de-bug and fix the errors.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thank you so much once again for all of your help!&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;V&lt;SPAN&gt;ery happy to be apart of this high quality community of SAS users! &amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Oct 2018 19:38:10 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-text-to-sentences-Outputting-same-Matches-Separately/m-p/502503#M134149</guid>
      <dc:creator>akvinay7</dc:creator>
      <dc:date>2018-10-08T19:38:10Z</dc:date>
    </item>
  </channel>
</rss>

