@MART1 wrote:
Thanks @andreas_lds
In the "real world", my dataset contains a list of users (column A) with their associated address (column B). The address may or may not contain the postcode. This dataset contains about 13M records.
In another dataset I have a list of all UK postcodes (distinct list); it's about 8M records.
The goal is to identify those users in the first dataset who have a postcode (any postcode) in their address.
[...]
Maybe I'm just too "demanding" and this would take quite some time?
Thanks
Yes, this will take some time. So i would not use the postcodes dataset in the first step, but a regular expression to verify that the address contains (most likely) a postcode. I am not that familiar with uk postcodes, so i can't suggest a perfect expression right know, this one may work:
.*([a-z]{1,2}\d[a-z]? \d[a-z]{2}).*
In a second step i would load the postcode dataset into a hash object to verify that the result of the regex is an existing postcode.
You can try two approaches:
We would need to see complete examples for your address variable, covering all structures you have in there.
Your "real world" explanation really helps to better understand the challenge. Some representative sample data provided via a SAS data step would likely get us even closer to your reality.
Below code based on what I found on the Internet how UK postcodes look like: https://ideal-postcodes.co.uk/guides/uk-postcode-format
The code syntax to extract the codes from a string based on SAS code sample from here.
Below code may not be fully suitable for your data but it should give you the necessary pointers ...and depending on your hardware the code should execute within minutes (or less).
...and I just realized that I've done things only for the Outward portion of the UK postcode. To change things for the full postcode shouldn't be that hard though. If you want SAS code for it then please provide representative sample data in the form of working data step code creating this data (the "users" and "postcode" tables).
data users;
infile datalines truncover dlm='|';
input user $ addr $40.;
datalines;
aa|aaad bbbd 123 xxx ab3x blah
aa|aaad bbbd 2 xxx c9 blah
aa|aaad bbbd xc9 xxx ab3x blah
aa|aaad bbbd c9 xxx ab3x blah
aa|aaad bbbd c999 xxx ab3xy blah
;
data postcodes;
input postcode $;
datalines;
ab3x
c9
;
data want;
if _n_=1 then
do;
if 0 then
set postcodes(keep=postcode);
dcl hash h1(dataset:'postcodes');
h1.defineKey('postcode');
h1.defineDone();
end;
call missing(of _all_);
set users;
array postcode_candidate_ {4} $4;
length _found $4;
_ExpressionID = prxparse('/\b[[:alpha:]]{2}\d[[:alpha:]]\b|\b[[:alpha:]]\d[[:alpha:]]\b|\b[[:alpha:]]\d\b|\b[[:alpha:]]\d\d\b|\b[[:alpha:]]{2}\d\b|\b[[:alpha:]]{2}\d\d\b/');
_start = 1;
_stop = length(addr);
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the _start parameter so that searching */
/* begins again after the last match. */
call prxnext(_ExpressionID, _start, _stop, addr, _position, _length);
do while (_position > 0);
_found = substr(addr, _position, _length);
if h1.check(key:_found)=0 then
do;
_i=sum(_i,1);
postcode_candidate_[_i]=_found;
end;
call prxnext(_ExpressionID, _start, _stop, addr, _position, _length);
end;
drop _:;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.