Re: check if variables from a list exist in a table

yabwon · Posted 01-06-2022 12:51 PM

Some ideas for performance improvement (sorry that without details, but I'm out of town and writing from mobile):
1) are you able to sort dataset with codes by "town size", i.e. a postal code from London should cover bigger number of users, so if it is at the beginning of the array it should exit the do-loop faster.
2) try to load adres column into second array (so both data and lookup will be in memory). Array index will be pointing observation number in the dataset so you can use it in second data pass.
3) split lookup into several smaller datasets and run several concurrent sessions.
4) does the postal code have some form (e.g. 12-345) ? So you could use regexp to narrow down list of addresses.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

andreas_lds · Posted 01-07-2022 01:00 AM

@MART1 wrote:

Thanks @andreas_lds

In the "real world", my dataset contains a list of users (column A) with their associated address (column B). The address may or may not contain the postcode. This dataset contains about 13M records.

In another dataset I have a list of all UK postcodes (distinct list); it's about 8M records.

The goal is to identify those users in the first dataset who have a postcode (any postcode) in their address.

[...]

Maybe I'm just too "demanding" and this would take quite some time?

Thanks

Yes, this will take some time. So i would not use the postcodes dataset in the first step, but a regular expression to verify that the address contains (most likely) a postcode. I am not that familiar with uk postcodes, so i can't suggest a perfect expression right know, this one may work:

.*([a-z]{1,2}\d[a-z]? \d[a-z]{2}).*

In a second step i would load the postcode dataset into a hash object to verify that the result of the regex is an existing postcode.

Kurt_Bremser · Posted 01-07-2022 03:39 AM

You can try two approaches:

find a way to extract the postcode from the address field, so you can join with that value or use a hash object loaded from the lookup postcodes
load the lookup postcodes into an array and loop over it to perform a FINDW or similar function for each entry

We would need to see complete examples for your address variable, covering all structures you have in there.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 01-07-2022 05:58 AM

@MART1

Your "real world" explanation really helps to better understand the challenge. Some representative sample data provided via a SAS data step would likely get us even closer to your reality.

Below code based on what I found on the Internet how UK postcodes look like: https://ideal-postcodes.co.uk/guides/uk-postcode-format

The code syntax to extract the codes from a string based on SAS code sample from here.

Below code may not be fully suitable for your data but it should give you the necessary pointers ...and depending on your hardware the code should execute within minutes (or less).

...and I just realized that I've done things only for the Outward portion of the UK postcode. To change things for the full postcode shouldn't be that hard though. If you want SAS code for it then please provide representative sample data in the form of working data step code creating this data (the "users" and "postcode" tables).

data users;
  infile datalines truncover dlm='|';
  input user $ addr $40.;
  datalines;
aa|aaad bbbd 123 xxx ab3x blah
aa|aaad bbbd 2 xxx c9 blah
aa|aaad bbbd xc9 xxx  ab3x blah
aa|aaad bbbd c9 xxx  ab3x blah
aa|aaad bbbd c999 xxx  ab3xy blah
;

data postcodes;
  input postcode $;
  datalines;
ab3x
c9
;
data want;
  if _n_=1 then
    do;
      if 0 then
        set postcodes(keep=postcode);
      dcl hash h1(dataset:'postcodes');
      h1.defineKey('postcode');
      h1.defineDone();
    end;
  call missing(of _all_);

  set users;
  array postcode_candidate_ {4} $4;
  length _found $4;

  _ExpressionID = prxparse('/\b[[:alpha:]]{2}\d[[:alpha:]]\b|\b[[:alpha:]]\d[[:alpha:]]\b|\b[[:alpha:]]\d\b|\b[[:alpha:]]\d\d\b|\b[[:alpha:]]{2}\d\b|\b[[:alpha:]]{2}\d\d\b/');
  _start = 1;
  _stop = length(addr);

  /* Use PRXNEXT to find the first instance of the pattern, */
  /* then use DO WHILE to find all further instances.       */
  /* PRXNEXT changes the _start parameter so that searching  */
  /* begins again after the last match.                     */
  call prxnext(_ExpressionID, _start, _stop, addr, _position, _length);
  do while (_position > 0);
    _found = substr(addr, _position, _length);
    if h1.check(key:_found)=0 then
      do;
        _i=sum(_i,1);
        postcode_candidate_[_i]=_found;
      end;
    call prxnext(_ExpressionID, _start, _stop, addr, _position, _length);
  end;
  drop _:;
run;