Help using Base SAS procedures

Regular Expressions/PRX Patterns

Reply
Contributor
Posts: 64

Regular Expressions/PRX Patterns

I am new to regular expressions in SAS and  trying to use them for cleaning a few variables in a dataset. The variables are Company_Name and Address. The requirements for company name are as follows:

-Should not have special characters (except hyphen)

-Should not have country names embedded in it e.g. Hyundai USA Inc.

-Should not have any irrelevant text like 'UNKNOWN', 'TBD' etc.---This I can add to the code as per requirements

-Should not begin with a number

-Should be allowed to include words like LTD, Corp, Inc. and so on.

Ideally the cleaned data should look like "Test Company Inc." or "Test Company Corp".


Need to work on similar lines for addresses too.


I know this is too much to ask for but I would really appreciate it if I can get some help with the coding part of this. Thank you

Super User
Posts: 10,500

Re: Regular Expressions/PRX Patterns

Are you looking to show possible problem values or automagically edit them?

And what if the country name is part of the company name such as Air France?

And a company such as 3M is treated how?

Some ideas in: String Grouping & Matching

Super User
Posts: 17,829

Re: Regular Expressions/PRX Patterns

Can you assume that data within one data set is "clean"? Just something to think about.

I like Fried Egg's response from this thread a while back.

Respected Advisor
Posts: 3,892

Re: Regular Expressions/PRX Patterns

Below code won't solve everything for you but hopefully will give you at least some inspiration.

data have;

  infile datalines truncover;

  input Company_Name $50.;

  datalines;

Hyundai USA Inc.

Hyundai USA Inc.

Hyundai-Inc.

Hyundai_Inc.

UNKNOWN Inc.

TBD

;

run;

data lookup;

  infile datalines truncover;

  input word $30.;

  datalines;

UNKNOWN

TBD

XXX

USA

;

run;

data want;

  set have;

  length Company_Name_Cleansed $50.;

  Company_Name_Cleansed=Company_Name;

  /** replace all special characters with a blank **/

  /** http://support.sas.com/documentation/cdl/en/lefunctionsref/67398/HTML/default/viewer.htm#p0s9ilagexm... **/

  Company_Name_Cleansed=prxchange('s/[^\.-[:alnum:]]/ /oi',-1,Company_Name_Cleansed);

  /** set all names with unwanted words to missing **/

  if _n_=1 then

    do;

      if 0 then set lookup;

      dcl hash h1(dataset:'lookup');

      _rc=h1.defineKey('word');

      _rc=h1.defineData('word');

      _rc=h1.defineDone();

      dcl hiter hit1('h1');

    end;

    _rc = hit1.first();

    do while (_rc = 0);

       if findw(Company_Name_Cleansed,strip(word),' ','i')>0 then

        do;

/*          call missing(Company_Name_Cleansed);*/

/*          leave;*/

          Company_Name_Cleansed=transtrn(Company_Name_Cleansed,strip(word),'');

        end;

       _rc = hit1.next();

    end;

    Company_Name_Cleansed=compbl(left(Company_Name_Cleansed));

run;

Contributor
Posts: 64

Re: Regular Expressions/PRX Patterns

Hi Patrick,

Thanks for the help. Your logic is really helping me a lot. Just wondering how do I match addresses ending with digits using PRX patterns. Example. I want to flag everything ENDING with digits like "Napa Valley Street 10", "Rodeo Drive Suite 212, "Hwy 500".

Thanks again for your time.

Respected Advisor
Posts: 3,892

Re: Regular Expressions/PRX Patterns

Something like below could do.

data test;

  infile datalines truncover;

  input string $100.;

  if prxmatch('/\d+ *$/oi',string)> 1 then

    do;

      end_num_flg=1;

    end;

  else end_num_flg=0;

datalines;

Napa Valley Street 10

Rodeo Drive Suite 212

Hwy 500

Rodeo Drive 10 Suite

25 Hwy

some other street

;

run;

Ask a Question
Discussion stats
  • 5 replies
  • 253 views
  • 3 likes
  • 4 in conversation