Afternoon All,
I have a situation where i need to remove false positives but keep the correct answers only, in this similar example - the City Hamburg.
DATA address_data;
input all_address_one_variable $char38.;
datalines;
23 HAMBURGER ST (a)
HAMBURG Germany (b)
100 Hamburg Street, Netherlands (c)
666 schaufhassenstrasse, Hamburg 68801 (d)
12 hamburg st, England (e)
147 Hamburg Road Transylvania (f)
180 Hamburg-Germany; (g)
run;
The optimal output would be to show only b,d and g ONLY.
So the Question is this - how can i select those only where it is the actual name Hamburg which actually relates to a city, not a street or road named Hamburg?
I'm thinking a PRXMATCH or similar is required, but there may be other possibilities
Many thanks
MR_E
There seems to be a large set of possible configurations and you have to figure it out first
before designing the appropriate filter.
The following program handles your example by using a regexp that matches all non wanted cases :
data want;
set address_data;
if not prxmatch("/[a-z]*hamburg([a-z]| st| ro| av)/i",all_address_one_variable);
run;
Hi LinusH - thank you for the prompt response, however this is part of a large piece of logic already and will need a script to remove these instances and leave only the correct results- are there any other approaches / functions to hand that may assist with the prxmatch (or even replace)
There seems to be a large set of possible configurations and you have to figure it out first
before designing the appropriate filter.
The following program handles your example by using a regexp that matches all non wanted cases :
data want;
set address_data;
if not prxmatch("/[a-z]*hamburg([a-z]| st| ro| av)/i",all_address_one_variable);
run;
Hi.
As it was already said, you'll have to identify cleary what are the rules.
Looking at your data example, the following would be appropriate:
data OUT;
set address_data;
if prxmatch('/[^a-z]*hamburg[^a-z]+(?!st|ro)+/i',all_address_one_variable);
run;
If you need more info about regular expression check here: http://www.regular-expressions.info/
Hope it helps.
Daniel Santos @ www.cgd.pt
Does your exampe of picking the city HAMMBURG in GERMANY means that
correct answer is picking a specific given city in a specific given country ?
Can you trust addresses being in order format: home nomber - street name - city name - country name - zipcode
even if some parts are absent ?
In your example:
- filter address if ST or STREET follow city_string (ignoring case)
- accept address if country_string follows city_string
- accept if numeric zip code follows city_string
Try next code, check results and add rules - either to filter or to accept - more rows:
%let city_string = HAMBURG;
%let country_string = GERMANY;
data test;
set have;
adr = upcase(address);
if index(adr,'ST') > index(adr,"&city_string") then delete;
if index(adr,"&country_string") > index(adr,"&city_string") or
scan(adr,-1,' 0123456789') = "&city_string"
then output;
run;
I would look at cleaning up/standardizing the addresses first. If you can parse out the components into the valid values then you can move on from there more easily.
If this data has been collected since about 1980 then beatings with wet noodles are in order for whoever designed the data collection or data entry system. This problem has been around for a very long time and the basic fix is to collect data into separate fields at the start.
Rant over.
We see this question so often that there are entirely too many people that are not doing any planning before they collect data thatt it is scary.
A lot of time nowadays this could be for administrative data, that's collected for another purpose and then gets used for analysis when it was never intended to, or data that gets scraped off from different sources, such as text files.
Food for thought - the post office has been doing text recongition and address clean up for years, most of these algorithms recreate theirs. Too bad it's not public information since public dollars does the development.
Ballardw - beatings with wet noodles sounds strange , yet painful - i agree, the department responsible for cleansing the data beforehand have not provided us with the optimal data lake, but i can sing and dance about that all day long.... trust me i've had a rant too!
When you say "collect data into seperate fields at the start", if we are talking about identifying cities from a variable which contains a full address( or partial ) in one variable, and we want to determine if it is a city or a street, given all the scenarios possible, would seperating this help? i.e tokenisation?
Yes, tokenization is what you're after and there are data quality tools out there which could do a lot of the job for you.
Here what SAS has on offer: https://support.sas.com/documentation/onlinedoc/dfdmstudio/
If the data are entered/collected in a fixed order and there is something to delimit the field such as a comma similar to:
street, city, state/province then at worst you can parse by order of appearance. If the data isn't kept to some order then you start getting into something like identify the "clean" formatted data and then looking for patterns in the remaining data to process.
If there is no order then things could get very interesting. But if the general order above is maintained but may have a "missing" item then some searches could be done such as "is the last item a statename" (start by looking in the variables for something with the lowest number of levels). If so then search to see if the next to last is a 1) City name and 2) in the state, remainder would be street bits. If the last is not a state then search for cities. If it isn't a city then fun insues.
If you have a postal code similar to the United States Zip code then the city and state/province could well be helpful in finding the bit that is a street address because our Zip will identify a city and state. Find those in the unstructed data and what is left is street.
Thank you all for the responses and suggestions, i will try some of the solutions here today and other steps before hand to see what else we can do with this.
It's great to have the support and knowledge of people like yourselves out there, and greatly appreciated
I'll update with the solution chosen later.
On second thought you better also check rejected rows:
%let city_string = HAMBURG;
%let country_string = GERMANY;
data accepted rejected;
set have;
adr = upcase(address);
if index(adr,'ST') > index(adr,"&city_string")
then do; output rejected; return; end;
if index(adr,"&country_string") > index(adr,"&city_string") or
scan(adr,-1,' 0123456789') = "&city_string"
then output accepted;
run;
you may find need move rows from rejected to accepted and vice versa.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.