BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ody
Quartz | Level 8 Ody
Quartz | Level 8

Hello folks,

 

I have a large table of URL info (and associated fields) I would like to categorize using a partial match to info found in the URL. Instead of writing a case statement to account for this, what would be the best method? 

 

I've come across a few cases where people are using index in the where clause to match but that's not working 100% for me.

 

Here is some example data.

data test;
infile datalines dsd truncover;
input
match:$200.
field1:$200.
field2:$200.
;

datalines;
'test', 'val1', 'val2'
'testing', 'val3', 'val4'
'tested', 'val5', 'val6'
;;;


data test1;
infile datalines dsd truncover;
input
url:$200.
;

datalines;
'www.facebook.com/test/fsdfne'
'www.facebook.com/testing/deeheheksjbd'
'www.facebook.com/tested/5wnf4wljn/4rwef'
;

 

If I try some code like what's below Im getting a cartesian product. I only want to match on the exact characters from 'test.match'. How can I accomplish this?

 

As always, appreciate any help! 

 

 

proc sql;

create table URL_Match as

select 

a.url,

b.match,

b.field1,

b.field2

from test1 a, test b

where index(upper(a.url),strip(upper(b.match))) > 0;

quit;

 

 

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@Ody

If you want to use characters with special meaning as normal characters then mask them with a backward slash. For the example you've posted that would look as below:

  prxmatch('/home\.synchrony\.com\/summer/',<your variable>)

If you don't need to define patterns like in your sample string (which is just a constant string) then I'd use index() of find() as these functions are much less resource hungry.

View solution in original post

7 REPLIES 7
ballardw
Super User

Can you show what you want the output to look like.

What actual role do field1 and field2 play in the output?

How many sets of these values are you going to need to look for?

Is there a particular reason you are defining your variables match, field1 and field2 as 200 characters?

 

You might look into FINDW instead of Index in this case as "test" will match "testing" and "tested". Findw allows setting one or more delimiters so only words that match for the entire length

 

proc sql;
   create table URL_Match as
   select 
   a.url,
   b.match,
   b.field1,
   b.field2
   from test1 a , 
         test as b
   where findw(upcase(a.url),strip(upcase(b.match)),'/') > 0;
quit;
Ody
Quartz | Level 8 Ody
Quartz | Level 8

results.jpg

 

 

 

 

 

Results look like this.

 

Guardian
Calcite | Level 5

Hi Ody,

 

I recommend to use regular expressions,if you want to search/substitue complex patterns in strings (like url's).

 

SAS provides the prxmatch function for such tasks (for further informations I recommend the SAS help)

 

proc sql;

create table URL_Match as
select
a.url,
b.match,
b.field1,
b.field2
from test1 as a
inner join test as b
on  prxmatch("/\/"!!strip(b.match)!!"\//",a.url)>0;
quit;

 

Regular expressions are very powerful/standarized and available in a lot of modern programming languages.

 

You can find a lot of tutorials on the web, I strongly recommend to take a look.

 

Ody
Quartz | Level 8 Ody
Quartz | Level 8

Thanks for the feedback and potential solutions.

 

Ultimately, I need to rethink my approach to this problem. The URLs I'm trying to match against often contain multiple values (keywords) from my lookup list so I need to rethink how those lookup values get prioritized, otherwise I'm still gonna end up with multiple matches for one URL.

Ody
Quartz | Level 8 Ody
Quartz | Level 8

So, after some more research today I think that PRXMATCH is the way to go however I'm not sure how to account for metacharacters in my URL strings.

 

For example, when I try to match on this URL i get an error: prxmatch('/home.synchrony.com/summer/')

 

I have tried changing this to prxmatch('/home\.synchrony\.com\/summer/') to count the period and forward slash as regular characters but it's not working well for me. I'm not sure what I'm doing wrong. Thoughts?

Patrick
Opal | Level 21

@Ody

If you want to use characters with special meaning as normal characters then mask them with a backward slash. For the example you've posted that would look as below:

  prxmatch('/home\.synchrony\.com\/summer/',<your variable>)

If you don't need to define patterns like in your sample string (which is just a constant string) then I'd use index() of find() as these functions are much less resource hungry.

Ody
Quartz | Level 8 Ody
Quartz | Level 8

Thanks for the feedback.

 

Unfortunately I need explicit definitions for my my string search. I've tried find/index, and their variations, to no avail. I had some success with PRXMatch but kept running into issues with the metacharacters. 

 

I'll give this a go tomorrow, thanks again.

 

 

edit: Thanks for the help. Solution verified.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1247 views
  • 0 likes
  • 4 in conversation