BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
UMAnalyst
Obsidian | Level 7

I have a huge file (>200,000 obs) containing street addresses. I need to clean these data as well as possible. SAS states that one should remove the unit, apt number, etc from the street address before geocoding. I plan on using the position and length output from PRXSUBSTR to extract the apartment number from the street address.

 

So, I can get this to work:

 

data _null_;
  patternID = prxparse('/(\d+)$/');


/* Use PRXSUBSTR to find the position and length of the string. */
call prxsubstr(patternID, '12345 CONFUSED ST 1100', position, length);

put position= length=;
run;

 

But, when I apply this code to the data (on 'ADDRESSVAR' $50):

data OUT;

set IN;

  patternID = prxparse('/(\d+)$/');

call prxsubstr(patternID, ADDRESSVAR, position, length);

put position= length=;
run;

 

 

I get position = 0 and length = 0 for each obs. What am I missing?

 

Thanks for your help.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
jnvickery
Obsidian | Level 7

Hi,

 

I think it's due to the length of the ADDRESSVAR variable in the IN data set. The $ character is looking for the end of the line and ADDRESSVAR in the IN data set will be padded at the end.

 

If you wrap ADDRESSVAR in the TRIM function within the PRXSUBSTR, it will work.

 

Try this:

 

data OUT;
set IN;
patternID = prxparse('/(\d+)$/');
call prxsubstr(patternID, trim(ADDRESSVAR), position, length);
put position= length=;
run;

 

- John

View solution in original post

5 REPLIES 5
jnvickery
Obsidian | Level 7

Hi,

 

I think it's due to the length of the ADDRESSVAR variable in the IN data set. The $ character is looking for the end of the line and ADDRESSVAR in the IN data set will be padded at the end.

 

If you wrap ADDRESSVAR in the TRIM function within the PRXSUBSTR, it will work.

 

Try this:

 

data OUT;
set IN;
patternID = prxparse('/(\d+)$/');
call prxsubstr(patternID, trim(ADDRESSVAR), position, length);
put position= length=;
run;

 

- John

UMAnalyst
Obsidian | Level 7

spot on...

 

thank you very much.

Patrick
Opal | Level 21

@UMAnalyst

Just throwing in another way of how to extract a sub-string from a string using RegEx.

proc format;
  invalue $street_num
    's/^[^\d]*(\d+)\s*$/\1/oi' (regexpe) = _same_
    other=' '
    ;
run;

data sample;
  infile datalines truncover;
  input addressvar $char50.;
  length street_num $5.;
  street_num=input(addressvar,$street_num.);
  datalines;
AAAAAAAAAAAAAaAA aaa 40 
BBBB40
CCC40xx
40
  40
40xx
;
run; 

https://support.sas.com/resources/papers/proceedings12/245-2012.pdf

 

Patrick
Opal | Level 21

@jnvickery@UMAnalyst

You need to change the code as below using a retain statement for "patternID" as else a new version of the RegEx will get compiled in every single iteration of the data step. This is not only unnecessary and very inefficient it also clutters memory.

data OUT;
set IN;
retain patternID ;
if _n_=1 then patternID = prxparse('/(\d+)$/');
call prxsubstr(patternID, trim(ADDRESSVAR), position, length);
put position= length=;
run;
Jagadishkatam
Amethyst | Level 16

Hi ,

 

Sometimes there is a possibility that the address may not exactly have the apartment number at the end of the address, but consider that it might follow "ST", in that case our task is to extract the apartment number following the "ST". So please try the below regular expression 

data have;
input address$50.;
id=prxparse('/(\w\s\d+)/');
call prxsubstr(id,address,start,length);
put start= length=;
new2=substr(address,start+1,length);
cards;
12345 CONFUSED ST 1100
;

 

The expression \w will look for a letter followed by space(\s) and then followed by digits(\d+). This will recognise the apartnumber alternatively to the above code.

 

while extracting the apartnumber we use the substr function using the start and length variable. To the start variable please +1 so that it will skip the first letter and only extract the digit portion.

 

Hope this helps.

 

Thanks,

Jag

 

 

Thanks,
Jag

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 2568 views
  • 1 like
  • 4 in conversation