Hi SAS community,
I am performing text-matching using regular expression based on the values of two tables (simplified for demonstration purposes below). My issue is that I am very unsure about the best way to go about setting up my code for doing this check for matches between values of one table with values in another table.
My data looks like the following:
Table1 (Have1):
Keyword Category
Hello | Hey Greeting
Canada | America North America
Football | Basketball Sports
Date Comment
01JAN2021 Hello, my name is larsc
01JAN2021 I live in Europe
01JAN2021 I like to watch football
My desired output would be like:
Category Comment
Greeting Hello, my name is larsc
I live in Europe
Sports I like to watch football
Code3 Test3
In reality the comments are much longer than the ones given above in which the comments are written by engineers and the keywords are highly specific to the discipline.
I have done very (very) little looping thus far in SAS as may be evident from my pseudocode below, and I am unsure if setting this up as a loop is the best approach for my problem.
In pseudocode, my current approach would be something like the following:
data want; set have1 have2; do over have2; *Loop through comments in second table do over have1; *Loop over first table to check for match between comment and keyword, if match assign value of category to result if prxmatch(keyword, comment) then result = category; end; end; run;
Hopefully this demonstrates my issue sufficiently, but please let me know if you want me to expand on my problem. The coding is done within a User Written Code in Data Integration studio.
Should your dataset Want have as many rows as there are in Have2? In other words: Can there be more than one match? And what should happen if there is no match?
Hi Patrick,
My Want dataset should have as many records as dataset Have2. If no match can be found, the record from dataset Have2 should be output with a missing Category value (i.e., no match found). Let me add a desired output table to illustrate:
Table1 (Have1):
Keyword Category
Test1 Category1
Test2 Category2
Test3 Category3
Table2 (Have2):
Code Comment
Code1 Test1
Code2 Test2
Code3 Test3
Code4 Test4
Code5 Test5
Code6 Test1
Code7 Test1
Code8 Test2
Desired output (Want):
Category Comment
Category1 Test1
Category2 Test2
Category3 Test3
Test4
Test5
Category1 Test1
Category1 Test1
Category2 Test2
Your pseudo-data suggests you are doing exact matches between COMMENT in HAVE2 and KEYWORD in HAVE1. So why are you using prxmatch instead of a straightforward join in proc sql?
Hey mkeintz,
Sorry, I should have been more precise in my description. I am not actually doing exact matching between my comments and keywords, my example is oversimplified. In reality, the data looks more like this:
Table1 (Have1):
Keyword Category
Hello | Hey Greeting
Canada | America North America
Football | Basketball Sports
Date Comment
01JAN2021 Hello, my name is larsc
01JAN2021 I live in Europe
01JAN2021 I like to watch football
Category Comment
Greeting Hello, my name is larsc
I live in Europe
Sports I like to watch football
In reality the comments are much longer than the ones given above in which the comments are written by engineers and the keywords are highly specific to the discipline.
@larsc wrote:
Hey mkeintz,
Sorry, I should have been more precise in my description. I am not actually doing exact matching between my comments and keywords, my example is oversimplified. In reality, the data looks more like this:
Table1 (Have1):
Keyword Category
Hello | Hey Greeting
Canada | America North America
Football | Basketball Sports
Date Comment
01JAN2021 Hello, my name is larsc
01JAN2021 I live in Europe
01JAN2021 I like to watch football
Category Comment
Greeting Hello, my name is larsc
I live in Europe
Sports I like to watch football
In reality the comments are much longer than the ones given above in which the comments are written by engineers and the keywords are highly specific to the discipline.
Yes, you should have posted this example in the initial post as working data step.
So before i start, are all words in "Keyword" unique? Meaning each word is only once in the dataset.
Thanks Andreas, still a Fluorite so much to be desired from my posts I am sure.
In theory they should be and you can use that assumption.
You demonstrated your data as
My data looks like the following:
Table1(Have1):
Keyword Category
Hello | Hey Greeting
Canada | America North America
Football | Basketball Sports
Date Comment
01JAN2021 Hello, my name is larsc
01JAN2021 I live in Europe
01JAN2021 I like to watch football
My desired output would be like:
Category Comment
Greeting Hello, my name is larsc
I live in Europe
Sports I like to watch football
Few clarifications needed:
Suggestion:
1) create a dataset with each keyword and its category.
2) match by SQL on upcase(comment) contains upcase(keyword)
as in attached code:
data have1;
length keyword $15 category $20;
infile cards dlm=',' truncover;
input keyword $ category $;
cards;
Hello, Greeting
Hey, Greeting
Canada, North America
America, North America
Football, Sports
Basketball, Sports
; run;
data have2;
length comment $40;
infile cards dlm='|' truncover;
input Date date9. comment $;
cards;
01Jan2021| Hello, my name is larsc
01Jan2021| I live in Europe
01Jan2021| I like to watch football
; run;
proc sql;
create table want as
select a.category, b.comment
from have1 a right join have2 b
/* on index(upcase(comment),upcase(strip(keyword))) > 0 */
on upcase(b.comment) contains upcase(strip(a.keyword))
;quit;
Thanks for all the replies to this topic. Unfortunately none of the answers solved my issue satisfactorily, I create a new post with a more informative description.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.