Setting Boundaries for Faster Geocoding? (Box Algorithm?)

Accepted Solution Solved
Reply
Contributor
Posts: 33
Accepted Solution

Setting Boundaries for Faster Geocoding? (Box Algorithm?)

After talking with a colleague about how slow geocoding is, he suggested a boundary box and he had some thoughts around it but I'm not sure how I would code this. If anyone has experience with this and would like to share I would appreciate the advice!

 

His thoughts were that a box algorithm will create a square that encompasses the circle that you are searching within and anything inside that square is opted in for search, but anything outside is passed over. This should speed up the geocoding process, I'm just not sure how to do it! 

 

This is my code for finding the coordinates per person (proc geocode) and then calculating the distance from a single point of interest using geodist. The proc geocode step takes about 15 minutes right now when I run approximately 45k records through that step. That point of interest for geodist has fictitious lat/lon for this example,

/*use proc geocode to set lat and lon per person*/
options msglevel=N;
proc geocode 
	method=street /* Geocoding method */
	addressvar=add_line_1 /*should be address field, could also use ADDRESSSTATEVAR, ADDRESSZIPVAR, ADDRESSCITYVAR, etc to define other fields if not named correctly on source dataset*/
	data=work.groomed_population /* Input address data */
	lookupstreet=geo.usm /*needs to point at USM dataset downloaded from SAS*/
	out=work.geocoded; /* Output data set */
run;
quit;
options msglevel=I;

/*use geodist to find exact mileage between address and each persons lat/lon*/
data distance;
retain x y;
set geocoded;
dist = geodist( 12.123456, -12.123456, y, x, 'M' );
run;

Accepted Solutions
Solution
‎01-04-2017 10:30 AM
Contributor
Posts: 33

Re: Setting Boundaries for Faster Geocoding? (Box Algorithm?)

[ Edited ]

Just wanted to follow up with what I ended up doing. I had some downtime so I decided instead of modifying the full US datasets I downloaded from SAS, I would try to create my own that only housed the data I needed for my state.

 

So I did some research around TIGER census files and downloaded the TIGER2GEOCODE SAS code from SAS Maps Online. After figuring out everything that I needed to download from the census ftp site for street level data for my state (I couldn't find this documented anywhere so it took longer than I would have liked), I was able to use TIGER2GEOCODE to create new datasets without all of the extra state's data to use with proc geocode.

 

I increased my processing time by just over 10 minutes and I received the same results.

 

 

Anyone looking to use TIGER2GEOCODE to create their own base datasets for street level geocoding should look into the following:

 

Census FTP Site for 2016 Data

 

Find your state's FIPS code. Filenames on the census ftp will utilize this number. The entire FIPS number is 5 digits, and represents the state FIPS code (2 digits) followed by the county code (3 digits) 

 

Example: "tl_2016_01001_edges" represents a 2016 TIGER line edge file for state 01, county 001.

 

Files needed for TIGER2GEOCODE (disclaimer: there may be some extra files here but I think this is everything. I got to a point where I would run the code and read the error about what missing files I had and then go download it from the census site)

  • PLACE file for your state
  • ADDR files for your state and counties
  • FACE files for your state and counties
  • EDGE files for your state and counties
  • FEATNAME files for your state and counties

 

Once you have these files downloaded and unzipped, you're ready to use TIGER2GEOCODE to create your datasets for use with street level proc geocode

 

 

This is what worked for me, hopefully it works for others as well if someone finds this in the future. Thank you to everyone for the assistance provided earlier in the thread.

View solution in original post


All Replies
Super User
Posts: 19,080

Re: Setting Boundaries for Faster Geocoding? (Box Algorithm?)

So you're looking to limit the USM dataset? 

 

If it has STATE can you limit it by that for your search area? 

Contributor
Posts: 33

Re: Setting Boundaries for Faster Geocoding? (Box Algorithm?)

Correct, I may subset it by state and see if I can improve times that way. Thanks!
Super User
Posts: 11,118

Re: Setting Boundaries for Faster Geocoding? (Box Algorithm?)

You would have to subset the USM data set but if your addresses are all over the country I'm not sure there would be an effective way to do that. If all of your data were restricted to a number of states you could subset the USM data to reflect that but I'm not sure how much that would help.

 

Have you looked at these suggestions from SAS for improving performance:

 

  • Index your lookup data sets by using the appropriate variables.

  • Load the lookup data sets into memory by using the SASFILE statement.

Contributor
Posts: 33

Re: Setting Boundaries for Faster Geocoding? (Box Algorithm?)

Thank you for the ideas! I will play around and see what I can come up with.
SAS Employee
Posts: 170

Re: Setting Boundaries for Faster Geocoding? (Box Algorithm?)

 

A couple more suggestions:

Is your USM file on the same machine as the one you are running or is it remote?  Remote can slow it a lot,  And it is much worse if the data set was downloaded on a different machine Operating System.

 

Does Proc GEOCODE get most of the street address?  If it fails at the street geocoding, then goes to ZIP Code, then goes to CITY lookup, it will be slower because it is doing much more work.  This can happen if the data is "dirty".

 

Are any indexes to USM, USS, USP deleted?  That will make it much slower.

 

Of course, multiple people running on the machine, multiple accesses to the USM file and other factors can effect performance.

 

Here, we got about 11,000 obs per minute on a test run.  So, about 4 minutes.

Contributor
Posts: 33

Re: Setting Boundaries for Faster Geocoding? (Box Algorithm?)

Hi Darrell,

   Thank you for the suggestions. I'm matching 99% on street level data and the remainders are matched on zip, none make it to city. Indexes are okay, but the usm, uss, and usp files are stored in a network location as our desktop machines only utilize small SSD's and we don't have space to store them locally. I'm guessing the big slow down is the network access to the datasets.

 

I looked at only keeping data for my state and dropping the rest, but the way the datasets link together I saw some marginal room for error that I didn't feel was worth the risk or time to develop, as we only geocode occasionally.

 

I appreciate the help.

Solution
‎01-04-2017 10:30 AM
Contributor
Posts: 33

Re: Setting Boundaries for Faster Geocoding? (Box Algorithm?)

[ Edited ]

Just wanted to follow up with what I ended up doing. I had some downtime so I decided instead of modifying the full US datasets I downloaded from SAS, I would try to create my own that only housed the data I needed for my state.

 

So I did some research around TIGER census files and downloaded the TIGER2GEOCODE SAS code from SAS Maps Online. After figuring out everything that I needed to download from the census ftp site for street level data for my state (I couldn't find this documented anywhere so it took longer than I would have liked), I was able to use TIGER2GEOCODE to create new datasets without all of the extra state's data to use with proc geocode.

 

I increased my processing time by just over 10 minutes and I received the same results.

 

 

Anyone looking to use TIGER2GEOCODE to create their own base datasets for street level geocoding should look into the following:

 

Census FTP Site for 2016 Data

 

Find your state's FIPS code. Filenames on the census ftp will utilize this number. The entire FIPS number is 5 digits, and represents the state FIPS code (2 digits) followed by the county code (3 digits) 

 

Example: "tl_2016_01001_edges" represents a 2016 TIGER line edge file for state 01, county 001.

 

Files needed for TIGER2GEOCODE (disclaimer: there may be some extra files here but I think this is everything. I got to a point where I would run the code and read the error about what missing files I had and then go download it from the census site)

  • PLACE file for your state
  • ADDR files for your state and counties
  • FACE files for your state and counties
  • EDGE files for your state and counties
  • FEATNAME files for your state and counties

 

Once you have these files downloaded and unzipped, you're ready to use TIGER2GEOCODE to create your datasets for use with street level proc geocode

 

 

This is what worked for me, hopefully it works for others as well if someone finds this in the future. Thank you to everyone for the assistance provided earlier in the thread.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 283 views
  • 2 likes
  • 4 in conversation