@MART1 wrote:
Thanks @andreas_lds
In the "real world", my dataset contains a list of users (column A) with their associated address (column B). The address may or may not contain the postcode. This dataset contains about 13M records.
In another dataset I have a list of all UK postcodes (distinct list); it's about 8M records.
The goal is to identify those users in the first dataset who have a postcode (any postcode) in their address.
[...]
Maybe I'm just too "demanding" and this would take quite some time?
Thanks
Yes, this will take some time. So i would not use the postcodes dataset in the first step, but a regular expression to verify that the address contains (most likely) a postcode. I am not that familiar with uk postcodes, so i can't suggest a perfect expression right know, this one may work:
.*([a-z]{1,2}\d[a-z]? \d[a-z]{2}).*
In a second step i would load the postcode dataset into a hash object to verify that the result of the regex is an existing postcode.
You can try two approaches:
We would need to see complete examples for your address variable, covering all structures you have in there.
Your "real world" explanation really helps to better understand the challenge. Some representative sample data provided via a SAS data step would likely get us even closer to your reality.
Below code based on what I found on the Internet how UK postcodes look like: https://ideal-postcodes.co.uk/guides/uk-postcode-format
The code syntax to extract the codes from a string based on SAS code sample from here.
Below code may not be fully suitable for your data but it should give you the necessary pointers ...and depending on your hardware the code should execute within minutes (or less).
...and I just realized that I've done things only for the Outward portion of the UK postcode. To change things for the full postcode shouldn't be that hard though. If you want SAS code for it then please provide representative sample data in the form of working data step code creating this data (the "users" and "postcode" tables).
data users;
infile datalines truncover dlm='|';
input user $ addr $40.;
datalines;
aa|aaad bbbd 123 xxx ab3x blah
aa|aaad bbbd 2 xxx c9 blah
aa|aaad bbbd xc9 xxx ab3x blah
aa|aaad bbbd c9 xxx ab3x blah
aa|aaad bbbd c999 xxx ab3xy blah
;
data postcodes;
input postcode $;
datalines;
ab3x
c9
;
data want;
if _n_=1 then
do;
if 0 then
set postcodes(keep=postcode);
dcl hash h1(dataset:'postcodes');
h1.defineKey('postcode');
h1.defineDone();
end;
call missing(of _all_);
set users;
array postcode_candidate_ {4} $4;
length _found $4;
_ExpressionID = prxparse('/\b[[:alpha:]]{2}\d[[:alpha:]]\b|\b[[:alpha:]]\d[[:alpha:]]\b|\b[[:alpha:]]\d\b|\b[[:alpha:]]\d\d\b|\b[[:alpha:]]{2}\d\b|\b[[:alpha:]]{2}\d\d\b/');
_start = 1;
_stop = length(addr);
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the _start parameter so that searching */
/* begins again after the last match. */
call prxnext(_ExpressionID, _start, _stop, addr, _position, _length);
do while (_position > 0);
_found = substr(addr, _position, _length);
if h1.check(key:_found)=0 then
do;
_i=sum(_i,1);
postcode_candidate_[_i]=_found;
end;
call prxnext(_ExpressionID, _start, _stop, addr, _position, _length);
end;
drop _:;
run;
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.