DATA Step, Macro, Functions and more

Optimising Regular expressions when using multiple arrays

Accepted Solution Solved
Reply
Contributor
Posts: 26
Accepted Solution

Optimising Regular expressions when using multiple arrays

Morning World,

 

i am looking to optimise my code by reducing the statement below by removing the additional array match statement(/* before city name street match*/), which i believe can be put into another (/* DETERMINE IF ITS A MATCH ON A STREET/ROAD/AVENUE NAME*/) so that it can all be done on the STREET MATCH variable rather than that and BEFORE MATCH

 

The final result gives me what i need - identify an actual city, identify when its actually a street (followed by STREET/ Road etc) and taking into consideration when the street name is before (RUE DE PARIS , Swaziland for example)

 

The code works, but is much slower by adding in the regex before city name clause (worked fast on just city name street clause) - which i expect, but i want to obviously make this optimal. Also i expect more examples than rue de ... so i will need to build this in better also

 

%let rue = rue de;

data CITY_FINAL

array regex (&num_recs) $200 _temporary_;

array CITY_ARRAY (&num_recs) $200 _temporary_;

array regex_before (&num_recs) $200 _temporary_;

* load cities names into list;

if _n_=1 then

do p=1 to nobs;

set CITIES point=p nobs=nobs;

* Build REGEX for this "City_name with street/road/avenue";

regex(p)=cats("/\b(",City_name,")( st| ro| av)/i");

/* Build CITY_ARRAY for "City_name" */

CITY_ARRAY(p)=City_name;

/* Build array for instances where the street name is before the city name i.e rue de "City_name" .*/

regex_before(p)=cats("/\b(&rue )(",City_name,")/i");

 

end;

* keep matches;

set test_data;

/* DETERMINE IF ITS A MATCH ON A STREET/ROAD/AVENUE NAME*/

STREET_MATCH = 'N';

found=0;

do i=1 to nobs until(found);

found=prxmatch(regex(i),ADDRESS);

end;

if found THEN

DO;

STREET_MATCH='Y';

end;

/*ADD A MARKER TO IDENTIFY IF THERE IS A MATCH TO THE LISTED CITITES*/

CITY_MATCH = 'N';

/*RUN THROUGH THE ARRAY*/

DO j=1 TO dim(CITY_array);

/*CHECK AGAINST ADDRESS FIELDS FOR MATCH - IS THE CITY PART OF THE ADDRESS FIELD. USE UPPER CASE ON BOTH VARIABLES */

IF indexw(upcase(ADDRESS), upcase(CITY_ARRAY[j])) THEN

DO;

CITY_MATCH='Y';

LEAVE;

END;

end;

/* before city name street match*/

before_MATCH = 'N';

found2=0;

do k=1 to nobs until(found2);

found2=prxmatch(regex_before(k),ADDRESS);

end;

if found2 THEN

DO;

before_MATCH='Y';

end;

run;

 

TEST SAMPLES IF IT HELPS

 

DATA CITIES;

input City_name $char20.;

datalines;

KINGSLANDING

MORDOR

RIVENDELL

WINTERFELL

;

run;

DATA test_data;

input ADDRESS $char80.;

datalines;

HAMBURG

KINGSLANDING

GENEVA

paris

birmingham

Zurich

SINGAPORE

MORDOR - CAN YOU STILL IDENTIFY ME?

RIVENDELL 01 STILL TESTING WTH ADD TXT

TEST MORDOR ST

247 WINTERFELL STREET

MORDOR st

sMORDOR st

MORRDOR road

rue de RIVENDELL

;

run;


Accepted Solutions
Solution
‎01-16-2017 04:32 AM
Contributor
Posts: 26

Re: Optimising Regular expressions when using multiple arrays

[ Edited ]

this seems to work , but looks slight different - everything look ok from your POV? @gamotte

 

regex(p)=cats("/\b(&rue )(",City_name,")|\b(",City_name,")( st| ro| av)/i");

View solution in original post


All Replies
Regular Contributor
Posts: 222

Re: Optimising Regular expressions when using multiple arrays

Here is an idea you can try,

 

Replace your Arrays with Hash Objects, and Hash Iterators http://www.lexjansen.com/search/searchresults.php?q=hash%20Object

The Hash Object will provide you with functions to add(), and find(), and this will avoid/replace your Array Looping.

 

Just a different programming approach, hope it helps,

Ahmed

Regular Contributor
Posts: 240

Re: Optimising Regular expressions when using multiple arrays

Hello,


regex(p)=cats("/\b(&rue ",City_name,"|",City_name,"( st| ro| av))/i");
Contributor
Posts: 26

Re: Optimising Regular expressions when using multiple arrays

Hi @gamotte ,

 

I did actually come across this too, however it did not solve my issue as the Rue de Rivendell still is not identifiedSmiley Sadattached results)

 

performance wise it runs great, as it did without the before city name array and do loop before, but unfortunately did not identify Rue de

 

 


rivendell.PNG
Solution
‎01-16-2017 04:32 AM
Contributor
Posts: 26

Re: Optimising Regular expressions when using multiple arrays

[ Edited ]

this seems to work , but looks slight different - everything look ok from your POV? @gamotte

 

regex(p)=cats("/\b(&rue )(",City_name,")|\b(",City_name,")( st| ro| av)/i");

Regular Contributor
Posts: 240

Re: Optimising Regular expressions when using multiple arrays

I have left work si I can not test but I suspect my answer does not work because of some blanks. Try to remove the leading blank in the définition of &rue.
Regular Contributor
Posts: 240

Re: Optimising Regular expressions when using multiple arrays

The cats function removed the space after &rue. This should work :

 

regex(p)=cats("/\b(&rue.\s",City_name,"|",City_name,"( st| ro| av))/i");

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 208 views
  • 1 like
  • 3 in conversation