Solved: Matching text to sentences - Outputting same Matches Separately instea...

akvinay7 · Posted 10-08-2018 10:44 AM

Hello,

I have the following objective that I want to accomplish:

I have two tables. One of the tables (Table_one) has a column called 'Sentence'.
It has the values as follows below:

   SENTENCE
    I live in New York
    A bad day
    A very good day

I have another table (Table_two) with a column called 'Text' in the form:

    TEXT
    New York
    good day
    very good day

I want to match phrases in 'Text' to sentences in 'Sentences' to see if they are contained in any of the 'sentences' observations. I want to output those sentences that do contain the text.

I understand that this s not difficult in and of itself, but I have a unique case that I could not find much info online.

What I want is a table that results in:

    MATCH
    I live in New York
    A very good day
    A very good day

I've tried the following code:

    proc sql;
    create table match as
    select a.* from table_one as a, table_two as b
    where find(a.Sentence, b.Text)>0
;
run;

What I get is the below result:

    MATCH
    I live in New York
    A very good day

In other words, since the observations in Table_two: 'good day' & 'very good day' both are contained in the sentence of Table_one: 'A very good day', it is treated like a single observation and returned only once in the output.

I however would like both of the phrases to be treated like individual observations and be output twice like my desired output.

I have tried both the FIND() and INDEX() functions.
But both give me the same results.

Is there anyway to avoid the single observation output and get two separate observations even if there are phrases in the same sentence?

Any help would be greatly appreciated.

FreelanceReinh · Posted 10-08-2018 12:36 PM

I've prepared a modified version of TABLE_TWO where "good day" doesn't match because of a non-breakable space.

To investigate the issue I would add variable TEXT to the SELECT statement:

proc sql;
create table test as
select a.*, text from table_one as a, table_two as b
where find(a.Sentence, b.Text)>0;
quit;

In the result (dataset TEST) I'd notice that "good day" is missing in column TEXT.

Then I would display the non-matching values ("A very good day" and what should be "good day") in hexadecimal format:

proc print data=table_one(firstobs=3 obs=3);
format sentence $hex32.;
run;

proc print data=table_two(firstobs=2 obs=2);
format text $hex18.;
run;

(The format widths, 32 and 18, must be at least twice the lengths of the strings to be displayed, which are 15 and 8, resp., in this example.)

A comparison of the PROC PRINT results would reveal the issue:

Obs    sentence

 3     41207665727920676F6F642064617920

Obs    text

 2     676F6F64A064617920

To resolve the issue I'd create a "clean" version of TABLE_TWO, in this case with non-breakable spaces ('A0'x) replaced by ordinary space characters ('20'x = ' '😞

data clean_table_two;
set table_two;
text=translate(text,' ','A0'x);
run;

With CLEAN_TABLE_TWO in place of TABLE_TWO the result in dataset MATCH should be as expected, unless there are more issues.

View solution in original post

FreelanceReinh · Posted 10-08-2018 11:17 AM

Hello @akvinay7 and welcome to the SAS Support Communities!

I can't replicate your result. The below code results in three observations including the duplicate "A very good day", as it should.

data table_one;
input sentence $50.;
cards;
I live in New York
A bad day
A very good day
;

data table_two;
input text $30.;
cards;
New York
good day
very good day
;

proc sql;
create table match as
select a.* from table_one as a, table_two as b
where find(a.Sentence, b.Text)>0;
quit;

What do you get if you copy and run the above code? Maybe there is an invisible character (e.g. a non-breakable space in place of an ordinary space) in your TABLE_TWO so that one of the two anticipated matches fails.

FreelanceReinh · Posted 10-08-2018 12:36 PM

I've prepared a modified version of TABLE_TWO where "good day" doesn't match because of a non-breakable space.

To investigate the issue I would add variable TEXT to the SELECT statement:

proc sql;
create table test as
select a.*, text from table_one as a, table_two as b
where find(a.Sentence, b.Text)>0;
quit;

In the result (dataset TEST) I'd notice that "good day" is missing in column TEXT.

Then I would display the non-matching values ("A very good day" and what should be "good day") in hexadecimal format:

proc print data=table_one(firstobs=3 obs=3);
format sentence $hex32.;
run;

proc print data=table_two(firstobs=2 obs=2);
format text $hex18.;
run;

(The format widths, 32 and 18, must be at least twice the lengths of the strings to be displayed, which are 15 and 8, resp., in this example.)

A comparison of the PROC PRINT results would reveal the issue:

Obs    sentence

 3     41207665727920676F6F642064617920

Obs    text

 2     676F6F64A064617920

To resolve the issue I'd create a "clean" version of TABLE_TWO, in this case with non-breakable spaces ('A0'x) replaced by ordinary space characters ('20'x = ' '😞

data clean_table_two;
set table_two;
text=translate(text,' ','A0'x);
run;

With CLEAN_TABLE_TWO in place of TABLE_TWO the result in dataset MATCH should be as expected, unless there are more issues.

akvinay7 · Posted 10-08-2018 03:38 PM

Thank you so much @FreelanceReinh for your thorough investigation and welcome!

I had tried your code out, and I was getting the duplicate observation like you mentioned, which had me baffled.

This made me believe the culprit was something to with incorrect variable lengths.

I was reading my input from datasets, which I believe may have had those non-breakable spaces like you mentioned.

I do not know what kind of encoding they have used, but the data was 'messy' to say the least. I identified the 'messiness' using

the hexadecimal conversion idea like you described, and as you expected, there were many non '20' entries within the phrases to be matched.

I have created a script using your code to convert those to wrong values to '20' - or essentially 'clean' the dataset.

It took some time to clean the dataset, as there were a lot of different types of wrong codes in between.

I also found out that there were unnecessary trailing and ending spaces. So, I used the TRIM() function. The proc sql code was re-written below:

proc sql;

create table test as select a.* from table_one as a, table_two as b

where find(a.Sentence, TRIM(b.Text))>0;

quit;

This gave me all the right values with duplicated observations when applicable.

I am "accepting" this solution as the right solution as this helped me identify, de-bug and fix the errors.

Thank you so much once again for all of your help!

Very happy to be apart of this high quality community of SAS users!

Matching text to sentences - Outputting same Matches Separately instead of Collectively

Re: Matching text to sentences - Outputting same Matches Separately instead of Collectively

Re: Matching text to sentences - Outputting same Matches Separately instead of Collectively

Re: Matching text to sentences - Outputting same Matches Separately instead of Collectively

Re: Matching text to sentences - Outputting same Matches Separately instead of Collectively

Matching text to sentences - Outputting same Matches Separately instead of Collectively

Re: Matching text to sentences - Outputting same Matches Separately instead of Collectively

Re: Matching text to sentences - Outputting same Matches Separately instead of Collectively

Re: Matching text to sentences - Outputting same Matches Separately instead of Collectively

Re: Matching text to sentences - Outputting same Matches Separately instead of Collectively

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!