Loop over multiple datasets to perform text matching

larsc · Posted 03-28-2021 05:46 PM

Hi SAS community,

I am performing text-matching using regular expression based on the values of two tables (simplified for demonstration purposes below). My issue is that I am very unsure about the best way to go about setting up my code for doing this check for matches between values of one table with values in another table.

My data looks like the following:

Table1 (Have1):

Keyword Category

Hello | Hey Greeting

Canada | America North America

Football | Basketball Sports

Date Comment

01JAN2021 Hello, my name is larsc

01JAN2021 I live in Europe

01JAN2021 I like to watch football

My desired output would be like:

Category Comment

Greeting Hello, my name is larsc

I live in Europe

Sports I like to watch football

Code3 Test3

In reality the comments are much longer than the ones given above in which the comments are written by engineers and the keywords are highly specific to the discipline.

I have done very (very) little looping thus far in SAS as may be evident from my pseudocode below, and I am unsure if setting this up as a loop is the best approach for my problem.

In pseudocode, my current approach would be something like the following:

data want;
   set have1 have2;
   do over have2; *Loop through comments in second table
      do over have1; *Loop over first table to check for match between comment and keyword, if match assign value of category to result
          if prxmatch(keyword, comment) then result = category;
      end;
   end;
run;

Hopefully this demonstrates my issue sufficiently, but please let me know if you want me to expand on my problem. The coding is done within a User Written Code in Data Integration studio.

Patrick · Posted 03-28-2021 06:12 PM

Should your dataset Want have as many rows as there are in Have2? In other words: Can there be more than one match? And what should happen if there is no match?

larsc · Posted 03-28-2021 06:28 PM

Hi Patrick,

My Want dataset should have as many records as dataset Have2. If no match can be found, the record from dataset Have2 should be output with a missing Category value (i.e., no match found). Let me add a desired output table to illustrate:

Table1 (Have1):

Keyword Category

Test1 Category1

Test2 Category2

Test3 Category3

Table2 (Have2):

Code Comment

Code1 Test1

Code2 Test2

Code3 Test3

Code4 Test4

Code5 Test5

Code6 Test1

Code7 Test1

Code8 Test2

Desired output (Want):

Category Comment

Category1 Test1

Category2 Test2

Category3 Test3

Test4

Test5

Category1 Test1

Category2 Test2

Patrick · Posted 03-28-2021 06:39 PM

One way to go would be to create an informat with the RegEx like documented here or with more examples here.

You then can add an OTHER = " " for any case with no match.

mkeintz · Posted 03-29-2021 12:11 AM

Your pseudo-data suggests you are doing exact matches between COMMENT in HAVE2 and KEYWORD in HAVE1. So why are you using prxmatch instead of a straightforward join in proc sql?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

larsc · Posted 03-29-2021 02:26 AM

Hey mkeintz,

Sorry, I should have been more precise in my description. I am not actually doing exact matching between my comments and keywords, my example is oversimplified. In reality, the data looks more like this:

Table1 (Have1):

Keyword Category

Hello | Hey Greeting

Canada | America North America

Football | Basketball Sports

Date Comment

01JAN2021 Hello, my name is larsc

01JAN2021 I live in Europe

01JAN2021 I like to watch football

Category Comment

Greeting Hello, my name is larsc

I live in Europe

Sports I like to watch football

In reality the comments are much longer than the ones given above in which the comments are written by engineers and the keywords are highly specific to the discipline.

andreas_lds · Posted 03-29-2021 03:01 AM

@larsc wrote:

Hey mkeintz,

Sorry, I should have been more precise in my description. I am not actually doing exact matching between my comments and keywords, my example is oversimplified. In reality, the data looks more like this:

Table1 (Have1):

Keyword Category

Hello | Hey Greeting

Canada | America North America

Football | Basketball Sports

Date Comment

01JAN2021 Hello, my name is larsc

01JAN2021 I live in Europe

01JAN2021 I like to watch football

Category Comment

Greeting Hello, my name is larsc

I live in Europe

Sports I like to watch football

In reality the comments are much longer than the ones given above in which the comments are written by engineers and the keywords are highly specific to the discipline.

Yes, you should have posted this example in the initial post as working data step.

So before i start, are all words in "Keyword" unique? Meaning each word is only once in the dataset.

larsc · Posted 03-29-2021 03:28 AM

Thanks Andreas, still a Fluorite so much to be desired from my posts I am sure.

In theory they should be and you can use that assumption.

andreas_lds · Posted 03-29-2021 04:36 AM

@larsc wrote:

Thanks Andreas, still a Fluorite so much to be desired from my posts I am sure.

In theory they should be and you can use that assumption.

Fine. Then follow the links posted by @Patrick

Shmuel · Posted 03-29-2021 05:17 AM

You demonstrated your data as

My data looks like the following:

Table1(Have1):

Keyword Category

Hello | Hey Greeting

Canada | America North America

Football | Basketball Sports

Date Comment

01JAN2021 Hello, my name is larsc

01JAN2021 I live in Europe

01JAN2021 I like to watch football

My desired output would be like:

Category Comment

Greeting Hello, my name is larsc

I live in Europe

Sports I like to watch football

Few clarifications needed:

Are the bold words the variable names in your data (Keyword, Category, Date, Comment)
Can Keyword contain more then one keyword and if positive are they separated by '|' ?
and in case of match, shall they be written to different output observations?
It seems that by "matching" you mean that the keyword is present in the comment,
though maybe you need ignore case.

Suggestion:

1) create a dataset with each keyword and its category.

2) match by SQL on upcase(comment) contains upcase(keyword)
as in attached code:

data have1;
  length keyword $15 category $20;
  infile cards dlm=',' truncover;
  input keyword $ category $;
cards;
Hello, Greeting
Hey, Greeting
Canada, North America
America, North America
Football, Sports
Basketball, Sports
; run;
data have2;
  length comment $40;
  infile cards dlm='|' truncover;
  input Date date9.  comment $;
cards;
01Jan2021| Hello, my name is larsc
01Jan2021| I live in Europe
01Jan2021| I like to watch football
; run;
proc sql;
  create table want as
  select a.category, b.comment
  from have1 a right join have2 b
/*     on index(upcase(comment),upcase(strip(keyword))) > 0 */
    on upcase(b.comment) contains upcase(strip(a.keyword))
;quit;

larsc · Posted 03-29-2021 03:12 PM

Thanks for all the replies to this topic. Unfortunately none of the answers solved my issue satisfactorily, I create a new post with a more informative description.

Loop over multiple datasets to perform text matching

Re: Loop over multiple datasets to perform text matching

Re: Loop over multiple datasets to perform text matching

Re: Loop over multiple datasets to perform text matching

Re: Loop over multiple datasets to perform text matching

Re: Loop over multiple datasets to perform text matching

Re: Loop over multiple datasets to perform text matching

Re: Loop over multiple datasets to perform text matching

Re: Loop over multiple datasets to perform text matching

Re: Loop over multiple datasets to perform text matching

Re: Loop over multiple datasets to perform text matching

Classroom Training Available!