DATA Step, Macro, Functions and more

Joining/Matching on URL

Accepted Solution Solved
Reply
Frequent Contributor
Frequent Contributor
Posts: 93
Accepted Solution

Joining/Matching on URL

Hello folks,

 

I have a large table of URL info (and associated fields) I would like to categorize using a partial match to info found in the URL. Instead of writing a case statement to account for this, what would be the best method? 

 

I've come across a few cases where people are using index in the where clause to match but that's not working 100% for me.

 

Here is some example data.

data test;
infile datalines dsd truncover;
input
match:$200.
field1:$200.
field2:$200.
;

datalines;
'test', 'val1', 'val2'
'testing', 'val3', 'val4'
'tested', 'val5', 'val6'
;;;


data test1;
infile datalines dsd truncover;
input
url:$200.
;

datalines;
'www.facebook.com/test/fsdfne'
'www.facebook.com/testing/deeheheksjbd'
'www.facebook.com/tested/5wnf4wljn/4rwef'
;

 

If I try some code like what's below Im getting a cartesian product. I only want to match on the exact characters from 'test.match'. How can I accomplish this?

 

As always, appreciate any help! 

 

 

proc sql;

create table URL_Match as

select 

a.url,

b.match,

b.field1,

b.field2

from test1 a, test b

where index(upper(a.url),strip(upper(b.match))) > 0;

quit;

 

 

 

 

 

 

 


Accepted Solutions
Solution
‎05-14-2018 10:29 PM
Respected Advisor
Posts: 4,668

Re: Joining/Matching on URL

@Ody

If you want to use characters with special meaning as normal characters then mask them with a backward slash. For the example you've posted that would look as below:

  prxmatch('/home\.synchrony\.com\/summer/',<your variable>)

If you don't need to define patterns like in your sample string (which is just a constant string) then I'd use index() of find() as these functions are much less resource hungry.

View solution in original post


All Replies
Super User
Posts: 13,293

Re: Joining/Matching on URL

Can you show what you want the output to look like.

What actual role do field1 and field2 play in the output?

How many sets of these values are you going to need to look for?

Is there a particular reason you are defining your variables match, field1 and field2 as 200 characters?

 

You might look into FINDW instead of Index in this case as "test" will match "testing" and "tested". Findw allows setting one or more delimiters so only words that match for the entire length

 

proc sql;
   create table URL_Match as
   select 
   a.url,
   b.match,
   b.field1,
   b.field2
   from test1 a , 
         test as b
   where findw(upcase(a.url),strip(upcase(b.match)),'/') > 0;
quit;
Frequent Contributor
Frequent Contributor
Posts: 93

Re: Joining/Matching on URL

results.jpg

 

 

 

 

 

Results look like this.

 

New Contributor
Posts: 2

Re: Joining/Matching on URL

Hi Ody,

 

I recommend to use regular expressions,if you want to search/substitue complex patterns in strings (like url's).

 

SAS provides the prxmatch function for such tasks (for further informations I recommend the SAS help)

 

proc sql;

create table URL_Match as
select
a.url,
b.match,
b.field1,
b.field2
from test1 as a
inner join test as b
on  prxmatch("/\/"!!strip(b.match)!!"\//",a.url)>0;
quit;

 

Regular expressions are very powerful/standarized and available in a lot of modern programming languages.

 

You can find a lot of tutorials on the web, I strongly recommend to take a look.

 

Frequent Contributor
Frequent Contributor
Posts: 93

Re: Joining/Matching on URL

Thanks for the feedback and potential solutions.

 

Ultimately, I need to rethink my approach to this problem. The URLs I'm trying to match against often contain multiple values (keywords) from my lookup list so I need to rethink how those lookup values get prioritized, otherwise I'm still gonna end up with multiple matches for one URL.

Frequent Contributor
Frequent Contributor
Posts: 93

Re: Joining/Matching on URL

So, after some more research today I think that PRXMATCH is the way to go however I'm not sure how to account for metacharacters in my URL strings.

 

For example, when I try to match on this URL i get an error: prxmatch('/home.synchrony.com/summer/')

 

I have tried changing this to prxmatch('/home\.synchrony\.com\/summer/') to count the period and forward slash as regular characters but it's not working well for me. I'm not sure what I'm doing wrong. Thoughts?

Solution
‎05-14-2018 10:29 PM
Respected Advisor
Posts: 4,668

Re: Joining/Matching on URL

@Ody

If you want to use characters with special meaning as normal characters then mask them with a backward slash. For the example you've posted that would look as below:

  prxmatch('/home\.synchrony\.com\/summer/',<your variable>)

If you don't need to define patterns like in your sample string (which is just a constant string) then I'd use index() of find() as these functions are much less resource hungry.

Frequent Contributor
Frequent Contributor
Posts: 93

Re: Joining/Matching on URL

[ Edited ]

Thanks for the feedback.

 

Unfortunately I need explicit definitions for my my string search. I've tried find/index, and their variations, to no avail. I had some success with PRXMatch but kept running into issues with the metacharacters. 

 

I'll give this a go tomorrow, thanks again.

 

 

edit: Thanks for the help. Solution verified.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 148 views
  • 0 likes
  • 4 in conversation