DATA Step, Macro, Functions and more

Regex to find position of string

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 14
Accepted Solution

Regex to find position of string

[ Edited ]

I have a huge file (>200,000 obs) containing street addresses. I need to clean these data as well as possible. SAS states that one should remove the unit, apt number, etc from the street address before geocoding. I plan on using the position and length output from PRXSUBSTR to extract the apartment number from the street address.

 

So, I can get this to work:

 

data _null_;
  patternID = prxparse('/(\d+)$/');


/* Use PRXSUBSTR to find the position and length of the string. */
call prxsubstr(patternID, '12345 CONFUSED ST 1100', position, length);

put position= length=;
run;

 

But, when I apply this code to the data (on 'ADDRESSVAR' $50):

data OUT;

set IN;

  patternID = prxparse('/(\d+)$/');

call prxsubstr(patternID, ADDRESSVAR, position, length);

put position= length=;
run;

 

 

I get position = 0 and length = 0 for each obs. What am I missing?

 

Thanks for your help.

 

 


Accepted Solutions
Solution
‎10-12-2015 12:33 PM
Occasional Contributor
Posts: 15

Re: Regex to find position of string

Hi,

 

I think it's due to the length of the ADDRESSVAR variable in the IN data set. The $ character is looking for the end of the line and ADDRESSVAR in the IN data set will be padded at the end.

 

If you wrap ADDRESSVAR in the TRIM function within the PRXSUBSTR, it will work.

 

Try this:

 

data OUT;
set IN;
patternID = prxparse('/(\d+)$/');
call prxsubstr(patternID, trim(ADDRESSVAR), position, length);
put position= length=;
run;

 

- John

View solution in original post


All Replies
Solution
‎10-12-2015 12:33 PM
Occasional Contributor
Posts: 15

Re: Regex to find position of string

Hi,

 

I think it's due to the length of the ADDRESSVAR variable in the IN data set. The $ character is looking for the end of the line and ADDRESSVAR in the IN data set will be padded at the end.

 

If you wrap ADDRESSVAR in the TRIM function within the PRXSUBSTR, it will work.

 

Try this:

 

data OUT;
set IN;
patternID = prxparse('/(\d+)$/');
call prxsubstr(patternID, trim(ADDRESSVAR), position, length);
put position= length=;
run;

 

- John

Occasional Contributor
Posts: 14

Re: Regex to find position of string

spot on...

 

thank you very much.

Respected Advisor
Posts: 3,908

Re: Regex to find position of string

[ Edited ]

@UMAnalyst

Just throwing in another way of how to extract a sub-string from a string using RegEx.

proc format;
  invalue $street_num
    's/^[^\d]*(\d+)\s*$/\1/oi' (regexpe) = _same_
    other=' '
    ;
run;

data sample;
  infile datalines truncover;
  input addressvar $char50.;
  length street_num $5.;
  street_num=input(addressvar,$street_num.);
  datalines;
AAAAAAAAAAAAAaAA aaa 40 
BBBB40
CCC40xx
40
  40
40xx
;
run; 

https://support.sas.com/resources/papers/proceedings12/245-2012.pdf

 

Respected Advisor
Posts: 3,908

Re: Regex to find position of string

@jnvickery@UMAnalyst

You need to change the code as below using a retain statement for "patternID" as else a new version of the RegEx will get compiled in every single iteration of the data step. This is not only unnecessary and very inefficient it also clutters memory.

data OUT;
set IN;
retain patternID ;
if _n_=1 then patternID = prxparse('/(\d+)$/');
call prxsubstr(patternID, trim(ADDRESSVAR), position, length);
put position= length=;
run;
Trusted Advisor
Posts: 1,131

Re: Regex to find position of string

[ Edited ]

Hi ,

 

Sometimes there is a possibility that the address may not exactly have the apartment number at the end of the address, but consider that it might follow "ST", in that case our task is to extract the apartment number following the "ST". So please try the below regular expression 

data have;
input address$50.;
id=prxparse('/(\w\s\d+)/');
call prxsubstr(id,address,start,length);
put start= length=;
new2=substr(address,start+1,length);
cards;
12345 CONFUSED ST 1100
;

 

The expression \w will look for a letter followed by space(\s) and then followed by digits(\d+). This will recognise the apartnumber alternatively to the above code.

 

while extracting the apartnumber we use the substr function using the start and length variable. To the start variable please +1 so that it will skip the first letter and only extract the digit portion.

 

Hope this helps.

 

Thanks,

Jag

 

 

Thanks,
Jag
🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 293 views
  • 1 like
  • 4 in conversation