BookmarkSubscribeRSS Feed
yabwon
Onyx | Level 15
Some ideas for performance improvement (sorry that without details, but I'm out of town and writing from mobile):
1) are you able to sort dataset with codes by "town size", i.e. a postal code from London should cover bigger number of users, so if it is at the beginning of the array it should exit the do-loop faster.
2) try to load adres column into second array (so both data and lookup will be in memory). Array index will be pointing observation number in the dataset so you can use it in second data pass.
3) split lookup into several smaller datasets and run several concurrent sessions.
4) does the postal code have some form (e.g. 12-345) ? So you could use regexp to narrow down list of addresses.

Bart
_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



andreas_lds
Jade | Level 19

@MART1 wrote:

Thanks @andreas_lds 

 

In the "real world", my dataset contains a list of users (column A) with their associated address (column B). The address may or may not contain the postcode. This dataset contains about 13M records.

 

In another dataset I have a list of all UK postcodes (distinct list); it's about 8M records.

 

The goal is to identify those users in the first dataset who have a postcode (any postcode) in their address.

[...]

Maybe I'm just too "demanding" and this would take quite some time?

Thanks


Yes, this will take some time. So i would not use the postcodes dataset in the first step, but a regular expression to verify that the address contains (most likely) a postcode. I am not that familiar with uk postcodes, so i can't suggest a perfect expression right know, this one may work:

 

.*([a-z]{1,2}\d[a-z]? \d[a-z]{2}).*

In a second step i would load the postcode dataset into a hash object to verify that the result of the regex is an existing postcode.

 

Kurt_Bremser
Super User

You can try two approaches:

  • find a way to extract the postcode from the address field, so you can join with that value or use a hash object loaded from the lookup postcodes
  • load the lookup postcodes into an array and loop over it to perform a FINDW or similar function for each entry

We would need to see complete examples for your address variable, covering all structures you have in there.

Patrick
Opal | Level 21

@MART1 

Your "real world" explanation really helps to better understand the challenge. Some representative sample data provided via a SAS data step would likely get us even closer to your reality.

 

Below code based on what I found on the Internet how UK postcodes look like: https://ideal-postcodes.co.uk/guides/uk-postcode-format 

The code syntax to extract the codes from a string based on SAS code sample from here.

 

Below code may not be fully suitable for your data but it should give you the necessary pointers ...and depending on your hardware the code should execute within minutes (or less).

 

...and I just realized that I've done things only for the Outward portion of the UK postcode. To change things for the full postcode shouldn't be that hard though. If you want SAS code for it then please provide representative sample data in the form of working data step code creating this data (the "users" and "postcode" tables).

data users;
  infile datalines truncover dlm='|';
  input user $ addr $40.;
  datalines;
aa|aaad bbbd 123 xxx ab3x blah
aa|aaad bbbd 2 xxx c9 blah
aa|aaad bbbd xc9 xxx  ab3x blah
aa|aaad bbbd c9 xxx  ab3x blah
aa|aaad bbbd c999 xxx  ab3xy blah
;

data postcodes;
  input postcode $;
  datalines;
ab3x
c9
;
data want;
  if _n_=1 then
    do;
      if 0 then
        set postcodes(keep=postcode);
      dcl hash h1(dataset:'postcodes');
      h1.defineKey('postcode');
      h1.defineDone();
    end;
  call missing(of _all_);

  set users;
  array postcode_candidate_ {4} $4;
  length _found $4;

  _ExpressionID = prxparse('/\b[[:alpha:]]{2}\d[[:alpha:]]\b|\b[[:alpha:]]\d[[:alpha:]]\b|\b[[:alpha:]]\d\b|\b[[:alpha:]]\d\d\b|\b[[:alpha:]]{2}\d\b|\b[[:alpha:]]{2}\d\d\b/');
  _start = 1;
  _stop = length(addr);

  /* Use PRXNEXT to find the first instance of the pattern, */
  /* then use DO WHILE to find all further instances.       */
  /* PRXNEXT changes the _start parameter so that searching  */
  /* begins again after the last match.                     */
  call prxnext(_ExpressionID, _start, _stop, addr, _position, _length);
  do while (_position > 0);
    _found = substr(addr, _position, _length);
    if h1.check(key:_found)=0 then
      do;
        _i=sum(_i,1);
        postcode_candidate_[_i]=_found;
      end;
    call prxnext(_ExpressionID, _start, _stop, addr, _position, _length);
  end;
  drop _:;
run;

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 6664 views
  • 4 likes
  • 9 in conversation