I'm trying to replace state abbreviations with state names in a free-text variable that often lists multiple states. TRANWRD works pretty well using arrays (see below) but in the example shown Alabama becomes AlabaMassachusetts. I was trying to use a regular expression to isolate abbreviations but they don't seem to take SAS variable values in the replacements. I also tried several ways to assign each abbreviation and state name to macro variables to use in the regular expression without success. I'd like to know:
1. What alternative method can I use to accomplish this task?
2. Even if another method is easier, I'd like to know how to update macro variables in real-time or some other method that can be used to enter text real-time into a regular expression.
I tried communities and google but couldn't find what I was looking for. Any input is appreciated.
Thanks.
data StateNames;
length states $300;
input states & $;
datalines;
CO(1), CT(1), FL(2), Alabama(1), MA(5), NJ(4), NY(2), PA(2), TX(1)
;
run;
data StateNames2;
set StateNames;
keep States_original States;
States_original=States;
array stateabr[59] $2 abr1-abr59 ('AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'FL' 'GA' 'HI' 'ID' 'IL' 'IN' 'IA' 'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS'
'MO' 'MT' 'NE' 'NV' 'NH' 'NJ' 'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'RI' 'SC' 'SD' 'TN' 'TX' 'UT' 'VT' 'VA' 'WA' 'WV' 'WI' 'WY' 'AS' 'DC' 'FM'
'GU' 'MH' 'MP' 'PW' 'PR' 'VI');
array statenms[59] $50 nms1-nms59 ('Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado' 'Connecticut' 'Delaware' 'Florida' 'Georgia' 'Hawaii'
'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky' 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota' 'Mississippi'
'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma'
'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina' 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia' 'Washington' 'West Virginia'
'Wisconsin' 'Wyoming' 'American Samoa' 'Washington DC' 'Federated States of Micronesia' 'Guam' 'Republic of the Marshall Islands'
'Commonwealth of the Northern Mariana Islands' 'Republic of Palau' 'Puerto Rico' 'US Virgin Islands');
do i=1 to 59;
/* Method 1 - Alabama becomes AlabaMassachusetts */
/* States=TRANWRD(States,stateabr{i},strip(statenms{i}));*/
/* Method 2 - No change to SAS variable values dont work in regular expressions? */
abrev_re=prxparse("s/[^a-z]stateabr{i}[^a-z]/statenms{i}/i");
call prxchange(abrev_re,-1,States);
/* Method 3 - Use a real-time updating macro variable (How?) */
/* CALL SYMPUT('MACRO_stateabr', stateabr{i}) ;*/
/* CALL SYMPUT('MACRO_statenms', statenms{i}) ;*/
/* abrev_re=prxparse("s/(?<=[^a-z])&MACRO_stateabr.(?=[^a-z])/&MACRO_statenms./i");*/
/* call prxchange(abrev_re,-1,States);*/
end;
run;
proc print data=StateNames2 noobs; run;
PRX does the job. Note the use of precompiled patterns and the word boundary metacharacter (\b).
data StateNames;
length states $300;
input states & $;
datalines;
CO(1), CT(1), FL(2), Alabama(1), MA(5), NJ(4), NY(2), PA(2), TX(1)
;
data StateNames2;
set StateNames;
keep States_original States;
States_original=States;
array stateabr[59] $2 abr1-abr59 ('AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'FL' 'GA' 'HI' 'ID' 'IL' 'IN' 'IA' 'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS'
'MO' 'MT' 'NE' 'NV' 'NH' 'NJ' 'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'RI' 'SC' 'SD' 'TN' 'TX' 'UT' 'VT' 'VA' 'WA' 'WV' 'WI' 'WY' 'AS' 'DC' 'FM'
'GU' 'MH' 'MP' 'PW' 'PR' 'VI');
array statenms[59] $50 nms1-nms59 ('Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado' 'Connecticut' 'Delaware' 'Florida' 'Georgia' 'Hawaii'
'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky' 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota' 'Mississippi'
'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma'
'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina' 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia' 'Washington' 'West Virginia'
'Wisconsin' 'Wyoming' 'American Samoa' 'Washington DC' 'Federated States of Micronesia' 'Guam' 'Republic of the Marshall Islands'
'Commonwealth of the Northern Mariana Islands' 'Republic of Palau' 'Puerto Rico' 'US Virgin Islands');
array prxId{59} _temporary_;
if not prxId{1} then
do i=1 to dim(prxId);
prxId{i} = prxparse(cats("s/\b",stateAbr{i},"\b/",stateNms{i},"/i"));
end;
do i = 1 to dim(prxId);
call prxchange(prxId{i},-1,States);
end;
run;
proc print data=StateNames2 noobs; run;
PG
PRX does the job. Note the use of precompiled patterns and the word boundary metacharacter (\b).
data StateNames;
length states $300;
input states & $;
datalines;
CO(1), CT(1), FL(2), Alabama(1), MA(5), NJ(4), NY(2), PA(2), TX(1)
;
data StateNames2;
set StateNames;
keep States_original States;
States_original=States;
array stateabr[59] $2 abr1-abr59 ('AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'FL' 'GA' 'HI' 'ID' 'IL' 'IN' 'IA' 'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS'
'MO' 'MT' 'NE' 'NV' 'NH' 'NJ' 'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'RI' 'SC' 'SD' 'TN' 'TX' 'UT' 'VT' 'VA' 'WA' 'WV' 'WI' 'WY' 'AS' 'DC' 'FM'
'GU' 'MH' 'MP' 'PW' 'PR' 'VI');
array statenms[59] $50 nms1-nms59 ('Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado' 'Connecticut' 'Delaware' 'Florida' 'Georgia' 'Hawaii'
'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky' 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota' 'Mississippi'
'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma'
'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina' 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia' 'Washington' 'West Virginia'
'Wisconsin' 'Wyoming' 'American Samoa' 'Washington DC' 'Federated States of Micronesia' 'Guam' 'Republic of the Marshall Islands'
'Commonwealth of the Northern Mariana Islands' 'Republic of Palau' 'Puerto Rico' 'US Virgin Islands');
array prxId{59} _temporary_;
if not prxId{1} then
do i=1 to dim(prxId);
prxId{i} = prxparse(cats("s/\b",stateAbr{i},"\b/",stateNms{i},"/i"));
end;
do i = 1 to dim(prxId);
call prxchange(prxId{i},-1,States);
end;
run;
proc print data=StateNames2 noobs; run;
PG
Very clever! Thanks for your help PGStats.
And an alternate approach that will only work for US.
data _null_;
length searchstr $ 300;
searchstr = 'The string has AL and MA state abbreviations in it MP';
array stateabr[59] $2 _temporary_ ('AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'FL' 'GA' 'HI' 'ID' 'IL' 'IN' 'IA' 'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS'
'MO' 'MT' 'NE' 'NV' 'NH' 'NJ' 'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'RI' 'SC' 'SD' 'TN' 'TX' 'UT' 'VT' 'VA' 'WA' 'WV' 'WI' 'WY' 'AS' 'DC' 'FM'
'GU' 'MH' 'MP' 'PW' 'PR' 'VI');
do i = 1 to dim(stateabr);
pos = findw(searchstr,stateabr);
if pos>0 then do;
searchstr=catx(' ',substr(searchstr,1,(pos-1)),strip(stnamel(stateabr)),substr(searchstr,(pos+3)));
put searchstr; /* this is to see each change as it happens for example*/
end;
end;
run;
One note. You should consider carefully your choice of replacing state codes written in lowercase. Many state codes are also two-letter English words. It might be better to restrict the replacement of common words (IN, ME, OK, OR, AS, etc) to uppercase.
PG
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.