BookmarkSubscribeRSS Feed
larsc
Obsidian | Level 7

Hi SAS community,

 

I am performing text-matching using regular expression based on the values of two tables (simplified for demonstration purposes below). My issue is that I am very unsure about the best way to go about setting up my code for doing this check for matches between values of one table with values in another table.

 

My data looks like the following:

Table1 (Have1):

Keyword                          Category

Hello | Hey                       Greeting

Canada | America            North America

Football | Basketball        Sports

 

Date                        Comment

01JAN2021             Hello, my name is larsc

01JAN2021             I live in Europe

01JAN2021             I like to watch football

 

My desired output would be like:

Category     Comment

Greeting       Hello, my name is larsc

                     I live in Europe

Sports          I like to watch football

Code3   Test3

 

In reality the comments are much longer than the ones given above in which the comments are written by engineers and the keywords are highly specific to the discipline.

 

I have done very (very) little looping thus far in SAS as may be evident from my pseudocode below, and I am unsure if setting this up as a loop is the best approach for my problem.

 

In pseudocode, my current approach would be something like the following:

 

data want;
   set have1 have2;
   do over have2; *Loop through comments in second table
      do over have1; *Loop over first table to check for match between comment and keyword, if match assign value of category to result
          if prxmatch(keyword, comment) then result = category;
      end;
   end;
run;

 

Hopefully this demonstrates my issue sufficiently, but please let me know if you want me to expand on my problem. The coding is done within a User Written Code in Data Integration studio.

 

10 REPLIES 10
Patrick
Opal | Level 21

Should your dataset Want have as many rows as there are in Have2? In other words: Can there be more than one match? And what should happen if there is no match?

larsc
Obsidian | Level 7

Hi Patrick,

 

My Want dataset should have as many records as dataset Have2. If no match can be found, the record from dataset Have2 should be output with a missing Category value (i.e., no match found). Let me add a desired output table to illustrate:

 

Table1 (Have1):

Keyword   Category

Test1          Category1

Test2          Category2

Test3          Category3

 

Table2 (Have2):

Code     Comment

Code1   Test1

Code2   Test2

Code3   Test3

Code4   Test4

Code5   Test5

Code6   Test1

Code7   Test1

Code8   Test2

 

Desired output (Want):

Category     Comment

Category1    Test1

Category2    Test2

Category3    Test3

                     Test4

                     Test5

Category1    Test1

Category1    Test1

Category2    Test2

 

Patrick
Opal | Level 21

One way to go would be to create an informat with the RegEx like documented here or with more examples here

You then can add an OTHER = " " for any case with no match.

 

mkeintz
PROC Star

Your pseudo-data suggests you are doing exact matches between COMMENT in HAVE2 and KEYWORD in HAVE1.  So why are you using prxmatch instead of a straightforward join in proc sql?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
larsc
Obsidian | Level 7

Hey mkeintz,

 

Sorry, I should have been more precise in my description. I am not actually doing exact matching between my comments and keywords, my example is oversimplified. In reality, the data looks more like this:

 

Table1 (Have1):

Keyword                          Category

Hello | Hey                       Greeting

Canada | America            North America

Football | Basketball        Sports

 

Date                        Comment

01JAN2021             Hello, my name is larsc

01JAN2021             I live in Europe

01JAN2021             I like to watch football

 

Category     Comment

Greeting       Hello, my name is larsc

                     I live in Europe

Sports          I like to watch football

 

In reality the comments are much longer than the ones given above in which the comments are written by engineers and the keywords are highly specific to the discipline.

andreas_lds
Jade | Level 19

@larsc wrote:

Hey mkeintz,

 

Sorry, I should have been more precise in my description. I am not actually doing exact matching between my comments and keywords, my example is oversimplified. In reality, the data looks more like this:

 

Table1 (Have1):

Keyword                          Category

Hello | Hey                       Greeting

Canada | America            North America

Football | Basketball        Sports

 

Date                        Comment

01JAN2021             Hello, my name is larsc

01JAN2021             I live in Europe

01JAN2021             I like to watch football

 

Category     Comment

Greeting       Hello, my name is larsc

                     I live in Europe

Sports          I like to watch football

 

In reality the comments are much longer than the ones given above in which the comments are written by engineers and the keywords are highly specific to the discipline.


Yes, you should have posted this example in the initial post as working data step.

So before i start, are all words in "Keyword" unique? Meaning each word is only once in the dataset.

larsc
Obsidian | Level 7

Thanks Andreas, still a Fluorite so much to be desired from my posts I am sure.

 

In theory they should be and you can use that assumption.

andreas_lds
Jade | Level 19

@larsc wrote:

Thanks Andreas, still a Fluorite so much to be desired from my posts I am sure.

 

In theory they should be and you can use that assumption.


Fine. Then follow the links posted by @Patrick 

Shmuel
Garnet | Level 18

You demonstrated your data as 

 

My data looks like the following:

Table1(Have1):

Keyword                          Category

Hello | Hey                       Greeting

Canada | America            North America

Football | Basketball        Sports

 

Date                        Comment

01JAN2021             Hello, my name is larsc

01JAN2021             I live in Europe

01JAN2021             I like to watch football

 

My desired output would be like:

Category     Comment

Greeting       Hello, my name is larsc

                     I live in Europe

Sports          I like to watch football

Few clarifications needed:

  • Are the bold words the variable names in your data (Keyword, Category, Date, Comment)
  • Can Keyword contain more then one keyword and if positive are they separated by '|' ?
    and in case of match, shall they be written to different output observations?
  • It seems that by "matching" you mean that the keyword is present in the comment, 
    though maybe you need ignore case.

Suggestion:

1) create a dataset with each keyword and its category.

2) match by SQL on upcase(comment) contains upcase(keyword)
    as in attached code:

data have1;
  length keyword $15 category $20;
  infile cards dlm=',' truncover;
  input keyword $ category $;
cards;
Hello, Greeting
Hey, Greeting
Canada, North America
America, North America
Football, Sports
Basketball, Sports
; run;
data have2;
  length comment $40;
  infile cards dlm='|' truncover;
  input Date date9.  comment $;
cards;
01Jan2021| Hello, my name is larsc
01Jan2021| I live in Europe
01Jan2021| I like to watch football
; run;
proc sql;
  create table want as
  select a.category, b.comment
  from have1 a right join have2 b
/*     on index(upcase(comment),upcase(strip(keyword))) > 0 */
    on upcase(b.comment) contains upcase(strip(a.keyword))
;quit;
  

 

 

 

larsc
Obsidian | Level 7

Thanks for all the replies to this topic. Unfortunately none of the answers solved my issue satisfactorily, I create a new post with a more informative description.

sas-innovate-2024.png

 

Time is running out to save with the early bird rate. Register by Friday, March 1 for just $695 - $100 off the standard rate.

 

Check out the agenda and get ready for a jam-packed event featuring workshops, super demos, breakout sessions, roundtables, inspiring keynotes and incredible networking events. 

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 947 views
  • 0 likes
  • 5 in conversation