DATA Step, Macro, Functions and more

Distinguishing an actual City value rather than a false positive

Accepted Solution Solved
Reply
Contributor
Posts: 26
Accepted Solution

Distinguishing an actual City value rather than a false positive

[ Edited ]

Afternoon All,

 

I have a situation where i need to remove false positives but keep the correct answers only, in this similar example - the City Hamburg.

 

DATA address_data;

input all_address_one_variable $char38.;

datalines;

23 HAMBURGER ST                                           (a)

HAMBURG Germany                                           (b)

100 Hamburg Street, Netherlands                        (c)

666 schaufhassenstrasse, Hamburg 68801         (d)

12 hamburg st, England                                       (e)

147 Hamburg Road Transylvania                         (f)

180 Hamburg-Germany;                                       (g)

run;

 

The optimal output would be to show only b,d and g ONLY.

 

So the Question is this - how can i select those only where it is the actual name Hamburg which actually relates to a city, not a street or road named Hamburg?

 

I'm thinking a PRXMATCH or similar is required, but there may be other possibilities

 

Many thanks

 

MR_E

 


Accepted Solutions
Solution
‎01-11-2017 08:06 AM
Regular Contributor
Posts: 233

Re: Distinguishing an actual City value rather than a false positive

There seems to be a large set of possible configurations and you have to figure it out first

before designing the appropriate filter.

 

The following program handles your example by using a regexp that matches all non wanted cases :

 

data want;
set address_data;
if not prxmatch("/[a-z]*hamburg([a-z]| st| ro| av)/i",all_address_one_variable);
run;

View solution in original post


All Replies
Super User
Posts: 5,424

Re: Distinguishing an actual City value rather than a false positive

Prxmatch alone can't fix it, you still need some logic to distinguish from the other cases you have. By doing so, you would probably mimic the functionality of SAS Data Management Studio with its QKB's. So would look in that direction, especially if you have more similar tasks to solve.
Data never sleeps
Contributor
Posts: 26

Re: Distinguishing an actual City value rather than a false positive

Hi LinusH - thank you for the prompt response, however this is part of a large piece of logic already and will need a script to remove these instances and leave only the correct results- are there any other approaches / functions to hand that may assist with the prxmatch (or even replace)

Solution
‎01-11-2017 08:06 AM
Regular Contributor
Posts: 233

Re: Distinguishing an actual City value rather than a false positive

There seems to be a large set of possible configurations and you have to figure it out first

before designing the appropriate filter.

 

The following program handles your example by using a regexp that matches all non wanted cases :

 

data want;
set address_data;
if not prxmatch("/[a-z]*hamburg([a-z]| st| ro| av)/i",all_address_one_variable);
run;
Contributor
Posts: 24

Re: Distinguishing an actual City value rather than a false positive

[ Edited ]

Hi.

 

As it was already said, you'll have to identify cleary what are the rules.

 

Looking at your data example, the following would be appropriate:

 

data OUT;
        set address_data;
        if prxmatch('/[^a-z]*hamburg[^a-z]+(?!st|ro)+/i',all_address_one_variable);
run;

 

If you need more info about regular expression check here: http://www.regular-expressions.info/

 

Hope it helps.

 

Daniel Santos @ www.cgd.pt

Trusted Advisor
Posts: 1,554

Re: Distinguishing an actual City value rather than a false positive

Does your exampe of picking the city HAMMBURG in GERMANY means that

correct answer is picking a specific given city in a specific given country ?

 

Can you trust addresses being in order format: home nomber - street name - city name - country name - zipcode 

even if some parts are absent ?

 

In your example:

- filter address if ST or STREET follow city_string (ignoring case)

- accept address if country_string follows city_string

- accept if numeric zip code follows city_string

 

Try next code, check results and add rules - either to filter or to accept - more rows:

 

%let city_string = HAMBURG;

%let country_string = GERMANY;

 

data test;

 set have;

      adr = upcase(address);

      if index(adr,'ST') > index(adr,"&city_string") then delete;

      if index(adr,"&country_string") > index(adr,"&city_string") or 

         scan(adr,-1,' 0123456789') = "&city_string"

      then output;

run;

Super User
Posts: 19,771

Re: Distinguishing an actual City value rather than a false positive

I would look at cleaning up/standardizing the addresses first. If you can parse out the components into the valid values then you can move on from there more easily. 

Super User
Posts: 11,343

Re: Distinguishing an actual City value rather than a false positive

[ Edited ]

If this data has been collected since about 1980 then beatings with wet noodles are in order for whoever designed the data collection or data entry system. This problem has been around for a very long time and the basic fix is to collect data into separate fields at the start.

 

Rant over.

We see this question so often that there are entirely too many people that are not doing any planning before they collect data thatt it is scary.

Super User
Posts: 19,771

Re: Distinguishing an actual City value rather than a false positive

A lot of time nowadays this could be for administrative data, that's collected for another purpose and then gets used for analysis when it was never intended to, or data that gets scraped off from different sources, such as text files.

 

Food for thought - the post office has been doing text recongition and address clean up for years, most of these algorithms recreate theirs. Too bad it's not public information since public dollars does the development. 

Contributor
Posts: 26

Re: Distinguishing an actual City value rather than a false positive

Ballardw - beatings with wet noodles sounds strange , yet painful - i agree, the department responsible for cleansing the data beforehand have not provided us with the optimal data lake, but i can sing and dance about that all day long.... trust me i've had a rant too!

 

When you say "collect data into seperate fields at the start", if we are talking about identifying cities from a variable which contains a full address( or partial ) in one variable, and we want to determine if it is a city or a street, given all the scenarios possible, would seperating this help? i.e tokenisation?

 

 

Respected Advisor
Posts: 4,173

Re: Distinguishing an actual City value rather than a false positive

Yes, tokenization is what you're after and there are data quality tools out there which could do a lot of the job for you. 

Here what SAS has on offer: https://support.sas.com/documentation/onlinedoc/dfdmstudio/

 

 

Super User
Posts: 11,343

Re: Distinguishing an actual City value rather than a false positive

If the data are entered/collected in a fixed order and there is something to delimit the field such as a comma similar to:

street, city, state/province then at worst you can parse by order of appearance. If the data isn't kept to some order then you start getting into something like identify the "clean" formatted data and then looking for patterns in the remaining data to process.

 

If there is no order then things could get very interesting. But if the general order above is maintained but may have a "missing" item then some searches could be done such as "is the last item a statename" (start by looking in the variables for something with the lowest number of levels). If so then search to see if the next to last is a 1) City name and 2) in the state, remainder would be street bits. If the last is not a state then search for cities. If it isn't a city then fun insues. 

 

If you have a postal code similar to the United States Zip code then the city and state/province could well be helpful in finding the bit that is a street address because our Zip will identify a city and state. Find those in the unstructed data and what is left is street.

Contributor
Posts: 26

Re: Distinguishing an actual City value rather than a false positive

Thank you all for the responses and suggestions, i will try some of the solutions here today and other steps before hand to see what else we can do with this.

 

It's great to have the support and knowledge of people like yourselves out there, and greatly appreciated

 

I'll update with the solution chosen later.

Trusted Advisor
Posts: 1,554

Re: Distinguishing an actual City value rather than a false positive

On second thought you better also check rejected rows:

 

%let city_string = HAMBURG;
%let country_string = GERMANY;
 
data accepted rejected;
 set have;
      adr = upcase(address);
      if index(adr,'ST') > index(adr,"&city_string") 
         then do; output rejected; return;  end;

      if index(adr,"&country_string") > index(adr,"&city_string") or 
         scan(adr,-1,' 0123456789') = "&city_string"
      then output accepted;
run;

you may find need move rows from rejected to accepted and vice versa.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 466 views
  • 7 likes
  • 8 in conversation