Solved: Distinguishing an actual City value rather than a false positive

MR_E · Posted 01-06-2017 08:59 AM

Afternoon All,

I have a situation where i need to remove false positives but keep the correct answers only, in this similar example - the City Hamburg.

DATA address_data;

input all_address_one_variable $char38.;

datalines;

23 HAMBURGER ST (a)

HAMBURG Germany (b)

100 Hamburg Street, Netherlands (c)

666 schaufhassenstrasse, Hamburg 68801 (d)

12 hamburg st, England (e)

147 Hamburg Road Transylvania (f)

180 Hamburg-Germany; (g)

run;

The optimal output would be to show only b,d and g ONLY.

So the Question is this - how can i select those only where it is the actual name Hamburg which actually relates to a city, not a street or road named Hamburg?

I'm thinking a PRXMATCH or similar is required, but there may be other possibilities

Many thanks

MR_E

gamotte · Posted 01-06-2017 09:44 AM

There seems to be a large set of possible configurations and you have to figure it out first

before designing the appropriate filter.

The following program handles your example by using a regexp that matches all non wanted cases :

data want;
set address_data;
if not prxmatch("/[a-z]*hamburg([a-z]| st| ro| av)/i",all_address_one_variable);
run;

View solution in original post

LinusH · Posted 01-06-2017 09:16 AM

Prxmatch alone can't fix it, you still need some logic to distinguish from the other cases you have. By doing so, you would probably mimic the functionality of SAS Data Management Studio with its QKB's. So would look in that direction, especially if you have more similar tasks to solve.

Data never sleeps

MR_E · Posted 01-06-2017 09:33 AM

Hi LinusH - thank you for the prompt response, however this is part of a large piece of logic already and will need a script to remove these instances and leave only the correct results- are there any other approaches / functions to hand that may assist with the prxmatch (or even replace)

gamotte · Posted 01-06-2017 09:44 AM

There seems to be a large set of possible configurations and you have to figure it out first

before designing the appropriate filter.

The following program handles your example by using a regexp that matches all non wanted cases :

data want;
set address_data;
if not prxmatch("/[a-z]*hamburg([a-z]| st| ro| av)/i",all_address_one_variable);
run;

Daniel-Santos · Posted 01-06-2017 10:12 AM

Hi.

As it was already said, you'll have to identify cleary what are the rules.

Looking at your data example, the following would be appropriate:

data OUT;
set address_data;
if prxmatch('/[^a-z]*hamburg[^a-z]+(?!st|ro)+/i',all_address_one_variable);
run;

If you need more info about regular expression check here: http://www.regular-expressions.info/

Hope it helps.

Daniel Santos @ www.cgd.pt

Shmuel · Posted 01-06-2017 10:23 AM

Does your exampe of picking the city HAMMBURG in GERMANY means that

correct answer is picking a specific given city in a specific given country ?

Can you trust addresses being in order format: home nomber - street name - city name - country name - zipcode

even if some parts are absent ?

In your example:

- filter address if ST or STREET follow city_string (ignoring case)

- accept address if country_string follows city_string

- accept if numeric zip code follows city_string

Try next code, check results and add rules - either to filter or to accept - more rows:

%let city_string = HAMBURG;

%let country_string = GERMANY;

data test;

set have;

adr = upcase(address);

if index(adr,'ST') > index(adr,"&city_string") then delete;

if index(adr,"&country_string") > index(adr,"&city_string") or

scan(adr,-1,' 0123456789') = "&city_string"

then output;

run;

Reeza · Posted 01-06-2017 10:55 AM

I would look at cleaning up/standardizing the addresses first. If you can parse out the components into the valid values then you can move on from there more easily.

ballardw · Posted 01-06-2017 11:53 AM

If this data has been collected since about 1980 then beatings with wet noodles are in order for whoever designed the data collection or data entry system. This problem has been around for a very long time and the basic fix is to collect data into separate fields at the start.

Rant over.

We see this question so often that there are entirely too many people that are not doing any planning before they collect data thatt it is scary.

Reeza · Posted 01-06-2017 12:03 PM

A lot of time nowadays this could be for administrative data, that's collected for another purpose and then gets used for analysis when it was never intended to, or data that gets scraped off from different sources, such as text files.

Food for thought - the post office has been doing text recongition and address clean up for years, most of these algorithms recreate theirs. Too bad it's not public information since public dollars does the development.

MR_E · Posted 01-09-2017 03:59 AM

Ballardw - beatings with wet noodles sounds strange , yet painful - i agree, the department responsible for cleansing the data beforehand have not provided us with the optimal data lake, but i can sing and dance about that all day long.... trust me i've had a rant too!

When you say "collect data into seperate fields at the start", if we are talking about identifying cities from a variable which contains a full address( or partial ) in one variable, and we want to determine if it is a city or a street, given all the scenarios possible, would seperating this help? i.e tokenisation?

Patrick · Posted 01-09-2017 06:46 AM

Yes, tokenization is what you're after and there are data quality tools out there which could do a lot of the job for you.

Here what SAS has on offer: https://support.sas.com/documentation/onlinedoc/dfdmstudio/

ballardw · Posted 01-09-2017 12:12 PM

If the data are entered/collected in a fixed order and there is something to delimit the field such as a comma similar to:

street, city, state/province then at worst you can parse by order of appearance. If the data isn't kept to some order then you start getting into something like identify the "clean" formatted data and then looking for patterns in the remaining data to process.

If there is no order then things could get very interesting. But if the general order above is maintained but may have a "missing" item then some searches could be done such as "is the last item a statename" (start by looking in the variables for something with the lowest number of levels). If so then search to see if the next to last is a 1) City name and 2) in the state, remainder would be street bits. If the last is not a state then search for cities. If it isn't a city then fun insues.

If you have a postal code similar to the United States Zip code then the city and state/province could well be helpful in finding the bit that is a street address because our Zip will identify a city and state. Find those in the unstructed data and what is left is street.

MR_E · Posted 01-09-2017 04:02 AM

Thank you all for the responses and suggestions, i will try some of the solutions here today and other steps before hand to see what else we can do with this.

It's great to have the support and knowledge of people like yourselves out there, and greatly appreciated

I'll update with the solution chosen later.

Shmuel · Posted 01-09-2017 04:30 AM

On second thought you better also check rejected rows:

%let city_string = HAMBURG;
%let country_string = GERMANY;
 
data accepted rejected;
 set have;
      adr = upcase(address);
      if index(adr,'ST') > index(adr,"&city_string") 
         then do; output rejected; return;  end;

      if index(adr,"&country_string") > index(adr,"&city_string") or 
         scan(adr,-1,' 0123456789') = "&city_string"
      then output accepted;
run;

you may find need move rows from rejected to accepted and vice versa.

Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Re: Distinguishing an actual City value rather than a false positive

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away