DATA Step, Macro, Functions and more

Loading Arrays with Multiple Dataset Values

Accepted Solution Solved
Reply
Contributor
Posts: 26
Accepted Solution

Loading Arrays with Multiple Dataset Values

Evening All,

 

I have an issue in a large piece of code where the array is not loading all values, it only creates the last possible one, meaning the rest of the array is not creating all possible values for me to match on with INDEX later on.

 

so here is an example:

 

DATA cities1;

input city_name $char50.;

datalines;

HAMBURG

Aleppo

GENEVA

paris

birmingham

Zurich

Coventry

SINGAPORE

PYONGYANG

;

run;

DATA suflist1;

input suffix $char50.;

datalines;

street

road

avenue

port

;

run;

 

data test;

array CITY_ARRAY (&num_recs) $200 _temporary_;

array SUFLIST (&num_recs)$200 _temporary_;

 

* load cities names into list;

if _n_=1 then

do p=1 to nobs;

SET SUFLIST1 point=p nobs=nobs;

SUFLIST(p)=SUFFIX;

set CITIES1 point=p nobs=nobs;

/* Build CITY_ARRAY for "City_name" */

CITY_ARRAY(p)=City_name;

end;

do i=1 to nobs;

name = upcase(catx(" ",(CITY_ARRAY[i]),(SUFLIST[i])));

end;

run;

 

So i want Name to show me the value obviously, so it gives me evidence that all values have been considered, so i would expect to see Hamburg Street , Hamburg Road all the way down to Pyongpang Port, however all i see is the latter, whats happened to the rest of the values? What have i crucially missed off?

 

As mentioned this is part of a bigger matching process with hundreds of values in the datastes im refering too, so a workaround with each individual variable in the query is not feasible

 

much appreciated

 

 

 


Accepted Solutions
Solution
‎01-18-2017 05:16 AM
Regular Contributor
Posts: 194

Re: Loading Arrays with Multiple Dataset Values

Hello,

 

You don't have to test for all possible combinations. If there is no city match then you won't have street match either.

You can test the city match first and if there is a match, check the street types against the matched city :

 

data test;
infile datalines truncover;
input address $100.;
datalines;
10 Hamburg street Rotterdam
5 Wilhelmstrasse Hamburg
;
run;

data want;
    array CITY_ARRAY (&num_cities.) $200 _temporary_;
    array SUFLIST (&num_suffixes.)$200 _temporary_;

    * load cities names into list;
    if _n_=1 then do;
        do p=1 to nobs;
            SET SUFLIST1 point=p nobs=nobs;
            SUFLIST(p)=SUFFIX;
        end;
        do p=1 to nobs2;
            set CITIES1 point=p nobs=nobs2;
            /* Build CITY_ARRAY for "City_name" */
            CITY_ARRAY(p)=City_name;
        end;
    end;

    set test;

    CITY_MATCH="N";
    STREET_MATCH="N";

    do i=1 to nobs2;
        City=upcase(CITY_ARRAY(i));

        found=indexw(upcase(address),City);

        if (found) then do;
            CITY_MATCH="Y";

            do j=1 to nobs;
                StreetType=upcase(SUFLIST(j));

				sourceString=upcase(address);
                searchString=catx(" ",City,StreetType);

				found2=indexw(sourceString,searchString);

                put sourceString= city= streetType= searchString= found2=;

                if found2 then do;
                    STREET_MATCH="Y";
                    leave;
                end;
            end;

            leave;
        end;
    end;
run;

View solution in original post


All Replies
Contributor
Posts: 26

Re: Loading Arrays with Multiple Dataset Values

actual code where the issue is if it helps:

where ALL_Address is the the address fields being matched against on the LARGE99 Data set

 

FYI THE CITY MATCH works fine as it should, but the CITY Match is the pain, where i believe this array is not performing as it should...

 

data CITY_test_data (DROP=City_name Country ref) NO_MATCH (DROP=City_name Country ref) POSSIBLE_STREET (DROP=City_name Country ref);

 

array CITY_ARRAY (&num_recs) $200 _temporary_;

array SUFLIST (&num_recs)$200 _temporary_;

 

* load cities names into list;

if _n_=1 then

do p=1 to nobs;

SET SUFLIST point=p nobs=nobs;

SUFLIST(p)=SUFFIX;

set CITIES point=p nobs=nobs;

/* Build CITY_ARRAY for "City_name" */

CITY_ARRAY(p)=City_name;

end;

* match against test data;

set LARGE99_data_PREP;

/* DETERMINE IF ITS A MATCH ON A STREET/ROAD/AVENUE NAME*/

STREET_MATCH = 'N';

found=0;

 

do i=1 to nobs until(found);

 

found=indexw(upcase(ALL_ADDRESS), upcase(catx(" ",(CITY_ARRAY[i]),(SUFLIST[i]))));

end;

if found THEN

DO;

STREET_MATCH='Y';

end;

end;

 

/*ADD A MARKER TO IDENTIFY IF THERE IS A MATCH TO THE LISTED CITITES*/

CITY_MATCH = 'N';

/*RUN THROUGH THE ARRAY*/

DO j=1 TO dim(CITY_array);

/*CHECK AGAINST ADDRESS FIELDS FOR MATCH - IS THE CITY PART OF THE ADDRESS FIELD. USE UPPER CASE ON BOTH VARIABLES */

 

IF indexw(upcase(ALL_ADDRESS), upcase(CITY_ARRAY[j])) THEN

DO;

CITY_MATCH='Y';

LEAVE;

END;

end;

 

IF CITY_MATCH = 'Y' AND STREET_MATCH = 'N' THEN OUTPUT CITY_test_data;

IF CITY_MATCH = 'N' AND STREET_MATCH = 'N' THEN OUTPUT NO_MATCH;

IF CITY_MATCH = 'Y' AND STREET_MATCH = 'Y' THEN OUTPUT POSSIBLE_STREET;

run;

Super User
Posts: 10,466

Re: Loading Arrays with Multiple Dataset Values

Where are you going to get a value for &Num_RECS? Code will fail without a value.

 

SET SUFLIST1 point=p nobs=nobs;

SUFLIST(p)=SUFFIX;

set CITIES1 point=p nobs=nobs;

 

Uses the same variables for POINT AND NOBS. REAL bad idea to reuse variable names from different sets as are you sure which one is being used where?

 

You really need to display what you think the results for this exercise are supposed to be as I suspect you are making it much more difficult than needed.

Regular Contributor
Posts: 194

Re: Loading Arrays with Multiple Dataset Values

If i understand your problem correctly,

- you have different numbers of records in your Cities1 and Suflist1 datasets so you have to use two macrovariables

and two variables for the nobs parameter,

- you have to use nested loops to generate all possible combinations

 

DATA cities1;
    input city_name $char50.;
	call symput("num_cities",_N_);
    datalines;
HAMBURG
Aleppo
GENEVA
paris
birmingham
Zurich
Coventry
SINGAPORE
PYONGYANG
;
run;
DATA suflist1;
    input suffix $char50.;
	call symput("num_suffixes",_N_);
    datalines;
street
road
avenue
port
;
run;
 
data test;
    array CITY_ARRAY (&num_cities.) $200 _temporary_;
    array SUFLIST (&num_suffixes.)$200 _temporary_;

    * load cities names into list;
    if _n_=1 then do;
        do p=1 to nobs;
            SET SUFLIST1 point=p nobs=nobs;
            SUFLIST(p)=SUFFIX;
		end;
		do p=1 to nobs2;
            set CITIES1 point=p nobs=nobs2;
            /* Build CITY_ARRAY for "City_name" */
            CITY_ARRAY(p)=City_name;
        end;
        do i=1 to nobs;
			do j=1 to nobs2;
	            name = upcase(catx(" ",(CITY_ARRAY[j]),(SUFLIST[i])));
				output;
			end;
        end;
	end;
run;
Contributor
Posts: 26

Re: Loading Arrays with Multiple Dataset Values

@gamotte you did understand correctly and it worked as i wanted on the dummy data thankyou, great stuff.

 

However when applying this logic in my full code the data step is taking an age (still running as i write this), could this be because of where i have placed the end statements? or is it down to the fact that &numrecs is 4067 obs and &suf_recs is 206 obs :

 

The end outputs will give me those where it is just a city in the address (i.e Paris (not followed by a suffix), Those which are a street (Paris Street) and those which match neither.

 

This previously worked fine on the city match, and i could prxmatch a few suffixes, but 200 requires something abit more strategic...

 

This is the still running code with your logic applied,

 

data CITY_test_valery_data (DROP=City_name Country ref) NO_MATCH (DROP=City_name Country ref) POSSIBLE_STREET (DROP=City_name Country ref);

/* array regex (&num_recs) $200 _temporary_;*/

array CITY_ARRAY (&num_recs) $200 _temporary_;

array SUFLIST (&suf_recs)$200 _temporary_;

 

* load cities names into list;

if _n_=1 then

do p=1 to nobs;

SET SUFLIST point=p nobs=nobs;

SUFLIST(p)=SUFFIX;

end;

do p=1 to nobs2;

set CITIES point=p nobs=nobs2;

/* Build CITY_ARRAY for "City_name" */

CITY_ARRAY(p)=City_name;

end;

* match against test data;

set LARGE99_data_PREP;

/* DETERMINE IF ITS A MATCH ON A STREET/ROAD/AVENUE NAME*/

STREET_MATCH = 'N';

found=0;

do i=1 to nobs until (found);

do j=1 to nobs2 until(found);

 

what_it_reads = upcase(catx(" ",(CITY_ARRAY[j]),(SUFLIST[i])));

found=indexw(upcase(ALL_ADDRESS), upcase(catx(" ",(CITY_ARRAY[j]),(SUFLIST[i]))));

end; /* are these end statements in the right place */

end; /* are these end statements in the right place */

if found THEN

DO;

STREET_MATCH='Y';

end;

/*ADD A MARKER TO IDENTIFY IF THERE IS A MATCH TO THE LISTED CITITES*/

CITY_MATCH = 'N';

/*RUN THROUGH THE ARRAY*/

DO k=1 TO dim(CITY_array);

/*CHECK AGAINST ADDRESS FIELDS FOR MATCH - IS THE CITY PART OF THE ADDRESS FIELD. USE UPPER CASE ON BOTH VARIABLES */

 

IF indexw(upcase(ALL_ADDRESS), upcase(CITY_ARRAY[k])) THEN

DO;

CITY_MATCH='Y';

LEAVE;

END;

end;

IF CITY_MATCH = 'Y' AND STREET_MATCH = 'N' THEN OUTPUT CITY_test_valery_data;

IF CITY_MATCH = 'N' AND STREET_MATCH = 'N' THEN OUTPUT NO_MATCH;

IF CITY_MATCH = 'Y' AND STREET_MATCH = 'Y' THEN OUTPUT POSSIBLE_STREET;

run;

Super User
Posts: 10,466

Re: Loading Arrays with Multiple Dataset Values

1) Post code into a code box opened with the {i} icon to preserve formatting

 

2)This code has the potential for never ending

   do i=1 to nobs until (found);
      do j=1 to nobs2 until(found);
          
         what_it_reads = upcase(catx(" ",(CITY_ARRAY[j]),(SUFLIST[i])));
         found=indexw(upcase(ALL_ADDRESS), upcase(catx(" ",(CITY_ARRAY[j]),(SUFLIST[i]))));
      end; /* are these end statements in the right place */
   end; /* are these end statements in the right place */

If the boundary of the city_array and suflist value does not fall on a WORD boundary(a character other than space at begin/end/middle ) in All_address the result of found may be 0 for all records. You may want a simple INDEX. Of if All_address has two or more consecutive spaces in the body then your match may fail. See this example:

 

data example;
   add = " This St,NYC NYC";
   str = "NYC NYC";
   resultw = indexw(add,str);
   put resultw=;
run;

Or this with 2 spaces

data example;
   add = " This St NYC  NYC";
   str = "NYC NYC";
   resultw = indexw(add,str);
   put resultw=;
run;
Solution
‎01-18-2017 05:16 AM
Regular Contributor
Posts: 194

Re: Loading Arrays with Multiple Dataset Values

Hello,

 

You don't have to test for all possible combinations. If there is no city match then you won't have street match either.

You can test the city match first and if there is a match, check the street types against the matched city :

 

data test;
infile datalines truncover;
input address $100.;
datalines;
10 Hamburg street Rotterdam
5 Wilhelmstrasse Hamburg
;
run;

data want;
    array CITY_ARRAY (&num_cities.) $200 _temporary_;
    array SUFLIST (&num_suffixes.)$200 _temporary_;

    * load cities names into list;
    if _n_=1 then do;
        do p=1 to nobs;
            SET SUFLIST1 point=p nobs=nobs;
            SUFLIST(p)=SUFFIX;
        end;
        do p=1 to nobs2;
            set CITIES1 point=p nobs=nobs2;
            /* Build CITY_ARRAY for "City_name" */
            CITY_ARRAY(p)=City_name;
        end;
    end;

    set test;

    CITY_MATCH="N";
    STREET_MATCH="N";

    do i=1 to nobs2;
        City=upcase(CITY_ARRAY(i));

        found=indexw(upcase(address),City);

        if (found) then do;
            CITY_MATCH="Y";

            do j=1 to nobs;
                StreetType=upcase(SUFLIST(j));

				sourceString=upcase(address);
                searchString=catx(" ",City,StreetType);

				found2=indexw(sourceString,searchString);

                put sourceString= city= streetType= searchString= found2=;

                if found2 then do;
                    STREET_MATCH="Y";
                    leave;
                end;
            end;

            leave;
        end;
    end;
run;
Contributor
Posts: 26

Re: Loading Arrays with Multiple Dataset Values

AWESOME - thank you so much, this is much faster and works as intended - exactly what i was looking for

 

Your help is greatly appreciated, thank you

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 144 views
  • 1 like
  • 3 in conversation