BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MR_E
Obsidian | Level 7

Afternoon All,

 

I have a situation where i need to remove false positives but keep the correct answers only, in this similar example - the City Hamburg.

 

DATA address_data;

input all_address_one_variable $char38.;

datalines;

23 HAMBURGER ST                                           (a)

HAMBURG Germany                                           (b)

100 Hamburg Street, Netherlands                        (c)

666 schaufhassenstrasse, Hamburg 68801         (d)

12 hamburg st, England                                       (e)

147 Hamburg Road Transylvania                         (f)

180 Hamburg-Germany;                                       (g)

run;

 

The optimal output would be to show only b,d and g ONLY.

 

So the Question is this - how can i select those only where it is the actual name Hamburg which actually relates to a city, not a street or road named Hamburg?

 

I'm thinking a PRXMATCH or similar is required, but there may be other possibilities

 

Many thanks

 

MR_E

 

1 ACCEPTED SOLUTION

Accepted Solutions
gamotte
Rhodochrosite | Level 12

There seems to be a large set of possible configurations and you have to figure it out first

before designing the appropriate filter.

 

The following program handles your example by using a regexp that matches all non wanted cases :

 

data want;
set address_data;
if not prxmatch("/[a-z]*hamburg([a-z]| st| ro| av)/i",all_address_one_variable);
run;

View solution in original post

13 REPLIES 13
LinusH
Tourmaline | Level 20
Prxmatch alone can't fix it, you still need some logic to distinguish from the other cases you have. By doing so, you would probably mimic the functionality of SAS Data Management Studio with its QKB's. So would look in that direction, especially if you have more similar tasks to solve.
Data never sleeps
MR_E
Obsidian | Level 7

Hi LinusH - thank you for the prompt response, however this is part of a large piece of logic already and will need a script to remove these instances and leave only the correct results- are there any other approaches / functions to hand that may assist with the prxmatch (or even replace)

gamotte
Rhodochrosite | Level 12

There seems to be a large set of possible configurations and you have to figure it out first

before designing the appropriate filter.

 

The following program handles your example by using a regexp that matches all non wanted cases :

 

data want;
set address_data;
if not prxmatch("/[a-z]*hamburg([a-z]| st| ro| av)/i",all_address_one_variable);
run;
Daniel-Santos
Obsidian | Level 7

Hi.

 

As it was already said, you'll have to identify cleary what are the rules.

 

Looking at your data example, the following would be appropriate:

 

data OUT;
        set address_data;
        if prxmatch('/[^a-z]*hamburg[^a-z]+(?!st|ro)+/i',all_address_one_variable);
run;

 

If you need more info about regular expression check here: http://www.regular-expressions.info/

 

Hope it helps.

 

Daniel Santos @ www.cgd.pt

Shmuel
Garnet | Level 18

Does your exampe of picking the city HAMMBURG in GERMANY means that

correct answer is picking a specific given city in a specific given country ?

 

Can you trust addresses being in order format: home nomber - street name - city name - country name - zipcode 

even if some parts are absent ?

 

In your example:

- filter address if ST or STREET follow city_string (ignoring case)

- accept address if country_string follows city_string

- accept if numeric zip code follows city_string

 

Try next code, check results and add rules - either to filter or to accept - more rows:

 

%let city_string = HAMBURG;

%let country_string = GERMANY;

 

data test;

 set have;

      adr = upcase(address);

      if index(adr,'ST') > index(adr,"&city_string") then delete;

      if index(adr,"&country_string") > index(adr,"&city_string") or 

         scan(adr,-1,' 0123456789') = "&city_string"

      then output;

run;

Reeza
Super User

I would look at cleaning up/standardizing the addresses first. If you can parse out the components into the valid values then you can move on from there more easily. 

ballardw
Super User

If this data has been collected since about 1980 then beatings with wet noodles are in order for whoever designed the data collection or data entry system. This problem has been around for a very long time and the basic fix is to collect data into separate fields at the start.

 

Rant over.

We see this question so often that there are entirely too many people that are not doing any planning before they collect data thatt it is scary.

Reeza
Super User

A lot of time nowadays this could be for administrative data, that's collected for another purpose and then gets used for analysis when it was never intended to, or data that gets scraped off from different sources, such as text files.

 

Food for thought - the post office has been doing text recongition and address clean up for years, most of these algorithms recreate theirs. Too bad it's not public information since public dollars does the development. 

MR_E
Obsidian | Level 7

Ballardw - beatings with wet noodles sounds strange , yet painful - i agree, the department responsible for cleansing the data beforehand have not provided us with the optimal data lake, but i can sing and dance about that all day long.... trust me i've had a rant too!

 

When you say "collect data into seperate fields at the start", if we are talking about identifying cities from a variable which contains a full address( or partial ) in one variable, and we want to determine if it is a city or a street, given all the scenarios possible, would seperating this help? i.e tokenisation?

 

 

Patrick
Opal | Level 21

Yes, tokenization is what you're after and there are data quality tools out there which could do a lot of the job for you. 

Here what SAS has on offer: https://support.sas.com/documentation/onlinedoc/dfdmstudio/

 

 

ballardw
Super User

If the data are entered/collected in a fixed order and there is something to delimit the field such as a comma similar to:

street, city, state/province then at worst you can parse by order of appearance. If the data isn't kept to some order then you start getting into something like identify the "clean" formatted data and then looking for patterns in the remaining data to process.

 

If there is no order then things could get very interesting. But if the general order above is maintained but may have a "missing" item then some searches could be done such as "is the last item a statename" (start by looking in the variables for something with the lowest number of levels). If so then search to see if the next to last is a 1) City name and 2) in the state, remainder would be street bits. If the last is not a state then search for cities. If it isn't a city then fun insues. 

 

If you have a postal code similar to the United States Zip code then the city and state/province could well be helpful in finding the bit that is a street address because our Zip will identify a city and state. Find those in the unstructed data and what is left is street.

MR_E
Obsidian | Level 7

Thank you all for the responses and suggestions, i will try some of the solutions here today and other steps before hand to see what else we can do with this.

 

It's great to have the support and knowledge of people like yourselves out there, and greatly appreciated

 

I'll update with the solution chosen later.

Shmuel
Garnet | Level 18

On second thought you better also check rejected rows:

 

%let city_string = HAMBURG;
%let country_string = GERMANY;
 
data accepted rejected;
 set have;
      adr = upcase(address);
      if index(adr,'ST') > index(adr,"&city_string") 
         then do; output rejected; return;  end;

      if index(adr,"&country_string") > index(adr,"&city_string") or 
         scan(adr,-1,' 0123456789') = "&city_string"
      then output accepted;
run;

you may find need move rows from rejected to accepted and vice versa.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1955 views
  • 7 likes
  • 8 in conversation