Greetings. I'm working on some addresses (thank you Art and Tom), that I am needing to parse and split into columns that can be joined to a DB2 table. I have one column in a spreadsheet containing the address, and I need to pull out the house number, the cardinal direction if present, and the street name. So, for example, if the address is '100 W Oak St', I need to pull out 100, 'W', and 'Oak'. I do not need the 'St', because it won't match what is the database in most cases, so I just want to omit it. I am parsing the address into an array, then pulling out the values based on some simple rules. I am good to go up until I need to stop building the street name at 'St' or 'Rd', etc. My thought was that I could select all the street names into a variable, then use it in a list to stop the loop. However, I also need to include all instances of street names appended with a '.' and a ',' since these are often found following the street abbreviation. The attached file contains the code, and hopefully adequate comments. Dataset 'Step1' is just what I need, but with the 'Rd', 'St', etc lopped off the end. Also, I realize streets like 'St Asaph', like is in the data will cause a problem to logic looking for a street abbreviation, and that is okay. I'm not looking for a 100% match to DB2, and since those are not common, I'm okay with leaving them blank.
Thank you
Greg
If your addresses are really that clean, I would just approach the problem differently. How about something like?:
data want (drop=_:);
set addresses;
length AD_HSE_NO 8.; *AD_HSE_NO is the database column for house number;
length AD_CRDNL_DIR $2.; *AD_CRDNL_DIR is the database column for direction;
length AD_STR_NM $50.; *AD_STR_NM is the database name for street name;
CALL SCAN(address, 2, _position, _length );
AD_HSE_NO=input(substr(address,1,_position-2),12.);
AD_STR_NM=strip(translate(substr(address,_position)," ","."));
if substr(AD_STR_NM,length(AD_STR_NM)-1) in (" N"," S"," E"," W") then do;
AD_CRDNL_DIR=substr(AD_STR_NM,length(AD_STR_NM)-1);
AD_STR_NM=strip(substr(AD_STR_NM,1,length(AD_STR_NM)-2));
end;
else if substr(AD_STR_NM,1,2) in ("N ","S ","E ","W ") then do;
AD_CRDNL_DIR=substr(AD_STR_NM,1,1);
AD_STR_NM=strip(substr(AD_STR_NM,3));
end;
CALL SCAN(AD_STR_NM, -1, _position, _length );
AD_STR_NM=strip(substr(AD_STR_NM,1,_position-1));
run;
Thank you Arthur. And therein lies the problem, they are not really that clean, I simply pulled out some of the easy ones to start with. For example, an address might be '100 Oak St behind garage'. The street abbreviation is not always in the last position, or I could do like you have done. I did not know about SCAN, I'll read up on that. Also, I put the array in there just so I could see what was happening, but I got rid of it now so it all happens in one pass. I'm still stuck on an easy way to stop at the street abbreviation. Thank you so much for your good time and effort.
Greg
this is what you want?
data Addresses;
input Address & $100.;
datalines;
200 Cameron Station Blvd
25 W Reed Ave
2701 Commonwealth Ave
5000 Echols Ave
3001 N Beauregard St
5650 Sanger Ave
5651 Rayburn Ave
530 S St Asaph St
215 Las Gaviotas Blvd.
4032 Bunch Walnuts Road
201 Stadium Dr
1116 Hillwell Rd
640 Battlefield Blvd S
298 Cedar Rd
212 Holt Dr
1280 Bells Mill Rd
1200 Cedar Rd
2901 Margaret Booker Dr
509 George Washington Hwy N
;
run;
data test;
set addresses;
n=max(find(address,'Blvd'),find(address,'Ave'),find(address,'St'),find(address,'Road'),find(address,'Dr'),find(address,'Hwy'),find(address,'Rd'));
number=scan(address,1,' ');
add=substr(address,anyalpha(address),n-length(number)-2);
drop n;
proc print;run;
Obs Address number add
1 200 Cameron Station Blvd 200 Cameron Station
2 25 W Reed Ave 25 W Reed
3 2701 Commonwealth Ave 2701 Commonwealth
4 5000 Echols Ave 5000 Echols
5 3001 N Beauregard St 3001 N Beauregard
6 5650 Sanger Ave 5650 Sanger
7 5651 Rayburn Ave 5651 Rayburn
8 530 S St Asaph St 530 S
9 215 Las Gaviotas Blvd. 215 Las Gaviotas
10 4032 Bunch Walnuts Road 4032 Bunch Walnuts
11 201 Stadium Dr 201 Stadium
12 1116 Hillwell Rd 1116 Hillwell
13 640 Battlefield Blvd S 640 Battlefield
14 298 Cedar Rd 298 Cedar
15 212 Holt Dr 212 Holt
16 1280 Bells Mill Rd 1280 Bells Mill
17 1200 Cedar Rd 1200 Cedar
18 2901 Margaret Booker Dr 2901 Margaret Booker
19 509 George Washington Hwy N 509 George Washington
Linlin, that's close, but I need to have the cardinal direction in its own colum, as I have in my example. What I am trying to avoid is hard coding all the possible values of street abbreviations. Thank you.
You could add Linlin's suggestion onto the code I proposed. Probably still won't solve all of your problems and, yes, would have to be adapted to include all of the street type words and abbreviations that exist in your data, but should get you close:
data want (drop=_:);
set addresses;
length AD_HSE_NO 8.; *AD_HSE_NO is the database column for house number;
length AD_CRDNL_DIR $2.; *AD_CRDNL_DIR is the database column for direction;
length AD_STR_NM $50.; *AD_STR_NM is the database name for street name;
CALL SCAN(address, 2, _position, _length );
AD_HSE_NO=input(substr(address,1,_position-2),12.);
AD_STR_NM=strip(translate(substr(address,_position)," ","."));
if substr(AD_STR_NM,length(AD_STR_NM)-1) in (" N"," S"," E"," W") then do;
AD_CRDNL_DIR=substr(AD_STR_NM,length(AD_STR_NM)-1);
AD_STR_NM=strip(substr(AD_STR_NM,1,length(AD_STR_NM)-2));
end;
else if substr(AD_STR_NM,1,2) in ("N ","S ","E ","W ") then do;
AD_CRDNL_DIR=substr(AD_STR_NM,1,1);
AD_STR_NM=strip(substr(AD_STR_NM,3));
end;
_n=max(find(address,'Blvd'),find(address,'Ave'),find(address,'St'),
find(address,'Road'),find(address,'Dr'),find(address,'Hwy'),
find(address,'Rd'));
if _n gt 0 then AD_STR_NM=strip(substr(AD_STR_NM,1,_n-1));
run;
Yes, I might have to revert back to good old SQL for this one. My initial reason for not wanting to hard code all the possible values of street abbreviation is that I don't want to have a line of code that is a thousand characters wide. Is there a way to continue a statment on the next line? For example, consider the below...
if x in ('a','b','c','d','e','f','g','h') then do;
In VB I can do this...
if x in ('a','b','c','d' & _
'e','f','g','h') then do;
where the '& _' tells the compiler to continue to the next line. Can we do this in SAS? If so, I would not mind typing all the abbreviations out.
Thank you
Greg
Yes.
"if x in ('a','b','c','d','e','f','g','h') then do;"
is the same as
"if x in ('a'
,'b'
,'c'
,'d','e','f','g','h')
then do;"
DOH!!! That seemed to easy to try. I got so caught up on having a ';' at the end of every line, I did not even think of that. Thank you.
Greg
Arthur, back to the scan function. Boy, did that eliminate my need for all the substrings and parsing I did manually. It seems scan is programmed to do exactly what I was doing, for the most part. One thing I cannot wrap my head around is why sometimes it is necessary to use 'call'. In this case, I can simply do...
do i = 1 to 10 by 1;
x = scan(string,i);
end;
What is the 'call' doing? Thank you.
Greg
As you already know functions and CALL routines are similar. In general functions return a value and CALL routines alter the values of variables.
With regards to scan, the function returns the string that is being scanned for. The call routine returns information about the location of that string. Depending on what you are doing the information may be more useful than the string.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.