Evening All,
I have an issue in a large piece of code where the array is not loading all values, it only creates the last possible one, meaning the rest of the array is not creating all possible values for me to match on with INDEX later on.
so here is an example:
DATA cities1;
input city_name $char50.;
datalines;
HAMBURG
Aleppo
GENEVA
paris
birmingham
Zurich
Coventry
SINGAPORE
PYONGYANG
;
run;
DATA suflist1;
input suffix $char50.;
datalines;
street
road
avenue
port
;
run;
data test;
array CITY_ARRAY (&num_recs) $200 _temporary_;
array SUFLIST (&num_recs)$200 _temporary_;
* load cities names into list;
if _n_=1 then
do p=1 to nobs;
SET SUFLIST1 point=p nobs=nobs;
SUFLIST(p)=SUFFIX;
set CITIES1 point=p nobs=nobs;
/* Build CITY_ARRAY for "City_name" */
CITY_ARRAY(p)=City_name;
end;
do i=1 to nobs;
name = upcase(catx(" ",(CITY_ARRAY[i]),(SUFLIST[i])));
end;
run;
So i want Name to show me the value obviously, so it gives me evidence that all values have been considered, so i would expect to see Hamburg Street , Hamburg Road all the way down to Pyongpang Port, however all i see is the latter, whats happened to the rest of the values? What have i crucially missed off?
As mentioned this is part of a bigger matching process with hundreds of values in the datastes im refering too, so a workaround with each individual variable in the query is not feasible
much appreciated
Hello,
You don't have to test for all possible combinations. If there is no city match then you won't have street match either.
You can test the city match first and if there is a match, check the street types against the matched city :
data test;
infile datalines truncover;
input address $100.;
datalines;
10 Hamburg street Rotterdam
5 Wilhelmstrasse Hamburg
;
run;
data want;
array CITY_ARRAY (&num_cities.) $200 _temporary_;
array SUFLIST (&num_suffixes.)$200 _temporary_;
* load cities names into list;
if _n_=1 then do;
do p=1 to nobs;
SET SUFLIST1 point=p nobs=nobs;
SUFLIST(p)=SUFFIX;
end;
do p=1 to nobs2;
set CITIES1 point=p nobs=nobs2;
/* Build CITY_ARRAY for "City_name" */
CITY_ARRAY(p)=City_name;
end;
end;
set test;
CITY_MATCH="N";
STREET_MATCH="N";
do i=1 to nobs2;
City=upcase(CITY_ARRAY(i));
found=indexw(upcase(address),City);
if (found) then do;
CITY_MATCH="Y";
do j=1 to nobs;
StreetType=upcase(SUFLIST(j));
sourceString=upcase(address);
searchString=catx(" ",City,StreetType);
found2=indexw(sourceString,searchString);
put sourceString= city= streetType= searchString= found2=;
if found2 then do;
STREET_MATCH="Y";
leave;
end;
end;
leave;
end;
end;
run;
actual code where the issue is if it helps:
where ALL_Address is the the address fields being matched against on the LARGE99 Data set
FYI THE CITY MATCH works fine as it should, but the CITY Match is the pain, where i believe this array is not performing as it should...
data CITY_test_data (DROP=City_name Country ref) NO_MATCH (DROP=City_name Country ref) POSSIBLE_STREET (DROP=City_name Country ref);
array CITY_ARRAY (&num_recs) $200 _temporary_;
array SUFLIST (&num_recs)$200 _temporary_;
* load cities names into list;
if _n_=1 then
do p=1 to nobs;
SET SUFLIST point=p nobs=nobs;
SUFLIST(p)=SUFFIX;
set CITIES point=p nobs=nobs;
/* Build CITY_ARRAY for "City_name" */
CITY_ARRAY(p)=City_name;
end;
* match against test data;
set LARGE99_data_PREP;
/* DETERMINE IF ITS A MATCH ON A STREET/ROAD/AVENUE NAME*/
STREET_MATCH = 'N';
found=0;
do i=1 to nobs until(found);
found=indexw(upcase(ALL_ADDRESS), upcase(catx(" ",(CITY_ARRAY[i]),(SUFLIST[i]))));
end;
if found THEN
DO;
STREET_MATCH='Y';
end;
end;
/*ADD A MARKER TO IDENTIFY IF THERE IS A MATCH TO THE LISTED CITITES*/
CITY_MATCH = 'N';
/*RUN THROUGH THE ARRAY*/
DO j=1 TO dim(CITY_array);
/*CHECK AGAINST ADDRESS FIELDS FOR MATCH - IS THE CITY PART OF THE ADDRESS FIELD. USE UPPER CASE ON BOTH VARIABLES */
IF indexw(upcase(ALL_ADDRESS), upcase(CITY_ARRAY[j])) THEN
DO;
CITY_MATCH='Y';
LEAVE;
END;
end;
IF CITY_MATCH = 'Y' AND STREET_MATCH = 'N' THEN OUTPUT CITY_test_data;
IF CITY_MATCH = 'N' AND STREET_MATCH = 'N' THEN OUTPUT NO_MATCH;
IF CITY_MATCH = 'Y' AND STREET_MATCH = 'Y' THEN OUTPUT POSSIBLE_STREET;
run;
Where are you going to get a value for &Num_RECS? Code will fail without a value.
SET SUFLIST1 point=p nobs=nobs;
SUFLIST(p)=SUFFIX;
set CITIES1 point=p nobs=nobs;
Uses the same variables for POINT AND NOBS. REAL bad idea to reuse variable names from different sets as are you sure which one is being used where?
You really need to display what you think the results for this exercise are supposed to be as I suspect you are making it much more difficult than needed.
If i understand your problem correctly,
- you have different numbers of records in your Cities1 and Suflist1 datasets so you have to use two macrovariables
and two variables for the nobs parameter,
- you have to use nested loops to generate all possible combinations
DATA cities1;
input city_name $char50.;
call symput("num_cities",_N_);
datalines;
HAMBURG
Aleppo
GENEVA
paris
birmingham
Zurich
Coventry
SINGAPORE
PYONGYANG
;
run;
DATA suflist1;
input suffix $char50.;
call symput("num_suffixes",_N_);
datalines;
street
road
avenue
port
;
run;
data test;
array CITY_ARRAY (&num_cities.) $200 _temporary_;
array SUFLIST (&num_suffixes.)$200 _temporary_;
* load cities names into list;
if _n_=1 then do;
do p=1 to nobs;
SET SUFLIST1 point=p nobs=nobs;
SUFLIST(p)=SUFFIX;
end;
do p=1 to nobs2;
set CITIES1 point=p nobs=nobs2;
/* Build CITY_ARRAY for "City_name" */
CITY_ARRAY(p)=City_name;
end;
do i=1 to nobs;
do j=1 to nobs2;
name = upcase(catx(" ",(CITY_ARRAY[j]),(SUFLIST[i])));
output;
end;
end;
end;
run;
@gamotte you did understand correctly and it worked as i wanted on the dummy data thankyou, great stuff.
However when applying this logic in my full code the data step is taking an age (still running as i write this), could this be because of where i have placed the end statements? or is it down to the fact that &numrecs is 4067 obs and &suf_recs is 206 obs :
The end outputs will give me those where it is just a city in the address (i.e Paris (not followed by a suffix), Those which are a street (Paris Street) and those which match neither.
This previously worked fine on the city match, and i could prxmatch a few suffixes, but 200 requires something abit more strategic...
This is the still running code with your logic applied,
data CITY_test_valery_data (DROP=City_name Country ref) NO_MATCH (DROP=City_name Country ref) POSSIBLE_STREET (DROP=City_name Country ref);
/* array regex (&num_recs) $200 _temporary_;*/
array CITY_ARRAY (&num_recs) $200 _temporary_;
array SUFLIST (&suf_recs)$200 _temporary_;
* load cities names into list;
if _n_=1 then
do p=1 to nobs;
SET SUFLIST point=p nobs=nobs;
SUFLIST(p)=SUFFIX;
end;
do p=1 to nobs2;
set CITIES point=p nobs=nobs2;
/* Build CITY_ARRAY for "City_name" */
CITY_ARRAY(p)=City_name;
end;
* match against test data;
set LARGE99_data_PREP;
/* DETERMINE IF ITS A MATCH ON A STREET/ROAD/AVENUE NAME*/
STREET_MATCH = 'N';
found=0;
do i=1 to nobs until (found);
do j=1 to nobs2 until(found);
what_it_reads = upcase(catx(" ",(CITY_ARRAY[j]),(SUFLIST[i])));
found=indexw(upcase(ALL_ADDRESS), upcase(catx(" ",(CITY_ARRAY[j]),(SUFLIST[i]))));
end; /* are these end statements in the right place */
end; /* are these end statements in the right place */
if found THEN
DO;
STREET_MATCH='Y';
end;
/*ADD A MARKER TO IDENTIFY IF THERE IS A MATCH TO THE LISTED CITITES*/
CITY_MATCH = 'N';
/*RUN THROUGH THE ARRAY*/
DO k=1 TO dim(CITY_array);
/*CHECK AGAINST ADDRESS FIELDS FOR MATCH - IS THE CITY PART OF THE ADDRESS FIELD. USE UPPER CASE ON BOTH VARIABLES */
IF indexw(upcase(ALL_ADDRESS), upcase(CITY_ARRAY[k])) THEN
DO;
CITY_MATCH='Y';
LEAVE;
END;
end;
IF CITY_MATCH = 'Y' AND STREET_MATCH = 'N' THEN OUTPUT CITY_test_valery_data;
IF CITY_MATCH = 'N' AND STREET_MATCH = 'N' THEN OUTPUT NO_MATCH;
IF CITY_MATCH = 'Y' AND STREET_MATCH = 'Y' THEN OUTPUT POSSIBLE_STREET;
run;
1) Post code into a code box opened with the {i} icon to preserve formatting
2)This code has the potential for never ending
do i=1 to nobs until (found); do j=1 to nobs2 until(found); what_it_reads = upcase(catx(" ",(CITY_ARRAY[j]),(SUFLIST[i]))); found=indexw(upcase(ALL_ADDRESS), upcase(catx(" ",(CITY_ARRAY[j]),(SUFLIST[i])))); end; /* are these end statements in the right place */ end; /* are these end statements in the right place */
If the boundary of the city_array and suflist value does not fall on a WORD boundary(a character other than space at begin/end/middle ) in All_address the result of found may be 0 for all records. You may want a simple INDEX. Of if All_address has two or more consecutive spaces in the body then your match may fail. See this example:
data example; add = " This St,NYC NYC"; str = "NYC NYC"; resultw = indexw(add,str); put resultw=; run;
Or this with 2 spaces
data example; add = " This St NYC NYC"; str = "NYC NYC"; resultw = indexw(add,str); put resultw=; run;
Hello,
You don't have to test for all possible combinations. If there is no city match then you won't have street match either.
You can test the city match first and if there is a match, check the street types against the matched city :
data test;
infile datalines truncover;
input address $100.;
datalines;
10 Hamburg street Rotterdam
5 Wilhelmstrasse Hamburg
;
run;
data want;
array CITY_ARRAY (&num_cities.) $200 _temporary_;
array SUFLIST (&num_suffixes.)$200 _temporary_;
* load cities names into list;
if _n_=1 then do;
do p=1 to nobs;
SET SUFLIST1 point=p nobs=nobs;
SUFLIST(p)=SUFFIX;
end;
do p=1 to nobs2;
set CITIES1 point=p nobs=nobs2;
/* Build CITY_ARRAY for "City_name" */
CITY_ARRAY(p)=City_name;
end;
end;
set test;
CITY_MATCH="N";
STREET_MATCH="N";
do i=1 to nobs2;
City=upcase(CITY_ARRAY(i));
found=indexw(upcase(address),City);
if (found) then do;
CITY_MATCH="Y";
do j=1 to nobs;
StreetType=upcase(SUFLIST(j));
sourceString=upcase(address);
searchString=catx(" ",City,StreetType);
found2=indexw(sourceString,searchString);
put sourceString= city= streetType= searchString= found2=;
if found2 then do;
STREET_MATCH="Y";
leave;
end;
end;
leave;
end;
end;
run;
AWESOME - thank you so much, this is much faster and works as intended - exactly what i was looking for
Your help is greatly appreciated, thank you
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.