Hello folks,
I have a large table of URL info (and associated fields) I would like to categorize using a partial match to info found in the URL. Instead of writing a case statement to account for this, what would be the best method?
I've come across a few cases where people are using index in the where clause to match but that's not working 100% for me.
Here is some example data.
data test; infile datalines dsd truncover; input match:$200. field1:$200. field2:$200. ; datalines; 'test', 'val1', 'val2' 'testing', 'val3', 'val4' 'tested', 'val5', 'val6' ;;;
data test1;
infile datalines dsd truncover;
input
url:$200.
;
datalines;
'www.facebook.com/test/fsdfne'
'www.facebook.com/testing/deeheheksjbd'
'www.facebook.com/tested/5wnf4wljn/4rwef'
;
If I try some code like what's below Im getting a cartesian product. I only want to match on the exact characters from 'test.match'. How can I accomplish this?
As always, appreciate any help!
proc sql;
create table URL_Match as
select
a.url,
b.match,
b.field1,
b.field2
from test1 a, test b
where index(upper(a.url),strip(upper(b.match))) > 0;
quit;
If you want to use characters with special meaning as normal characters then mask them with a backward slash. For the example you've posted that would look as below:
prxmatch('/home\.synchrony\.com\/summer/',<your variable>)
If you don't need to define patterns like in your sample string (which is just a constant string) then I'd use index() of find() as these functions are much less resource hungry.
Can you show what you want the output to look like.
What actual role do field1 and field2 play in the output?
How many sets of these values are you going to need to look for?
Is there a particular reason you are defining your variables match, field1 and field2 as 200 characters?
You might look into FINDW instead of Index in this case as "test" will match "testing" and "tested". Findw allows setting one or more delimiters so only words that match for the entire length
proc sql; create table URL_Match as select a.url, b.match, b.field1, b.field2 from test1 a , test as b where findw(upcase(a.url),strip(upcase(b.match)),'/') > 0; quit;
Results look like this.
Hi Ody,
I recommend to use regular expressions,if you want to search/substitue complex patterns in strings (like url's).
SAS provides the prxmatch function for such tasks (for further informations I recommend the SAS help)
proc sql;
create table URL_Match as
select
a.url,
b.match,
b.field1,
b.field2
from test1 as a
inner join test as b
on prxmatch("/\/"!!strip(b.match)!!"\//",a.url)>0;
quit;
Regular expressions are very powerful/standarized and available in a lot of modern programming languages.
You can find a lot of tutorials on the web, I strongly recommend to take a look.
Thanks for the feedback and potential solutions.
Ultimately, I need to rethink my approach to this problem. The URLs I'm trying to match against often contain multiple values (keywords) from my lookup list so I need to rethink how those lookup values get prioritized, otherwise I'm still gonna end up with multiple matches for one URL.
So, after some more research today I think that PRXMATCH is the way to go however I'm not sure how to account for metacharacters in my URL strings.
For example, when I try to match on this URL i get an error: prxmatch('/home.synchrony.com/summer/')
I have tried changing this to prxmatch('/home\.synchrony\.com\/summer/') to count the period and forward slash as regular characters but it's not working well for me. I'm not sure what I'm doing wrong. Thoughts?
If you want to use characters with special meaning as normal characters then mask them with a backward slash. For the example you've posted that would look as below:
prxmatch('/home\.synchrony\.com\/summer/',<your variable>)
If you don't need to define patterns like in your sample string (which is just a constant string) then I'd use index() of find() as these functions are much less resource hungry.
Thanks for the feedback.
Unfortunately I need explicit definitions for my my string search. I've tried find/index, and their variations, to no avail. I had some success with PRXMatch but kept running into issues with the metacharacters.
I'll give this a go tomorrow, thanks again.
edit: Thanks for the help. Solution verified.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.