BookmarkSubscribeRSS Feed
sasmaverick
Obsidian | Level 7

I am new to regular expressions in SAS and  trying to use them for cleaning a few variables in a dataset. The variables are Company_Name and Address. The requirements for company name are as follows:

-Should not have special characters (except hyphen)

-Should not have country names embedded in it e.g. Hyundai USA Inc.

-Should not have any irrelevant text like 'UNKNOWN', 'TBD' etc.---This I can add to the code as per requirements

-Should not begin with a number

-Should be allowed to include words like LTD, Corp, Inc. and so on.

Ideally the cleaned data should look like "Test Company Inc." or "Test Company Corp".


Need to work on similar lines for addresses too.


I know this is too much to ask for but I would really appreciate it if I can get some help with the coding part of this. Thank you

5 REPLIES 5
ballardw
Super User

Are you looking to show possible problem values or automagically edit them?

And what if the country name is part of the company name such as Air France?

And a company such as 3M is treated how?

Some ideas in: String Grouping & Matching

Reeza
Super User

Can you assume that data within one data set is "clean"? Just something to think about.

I like Fried Egg's response from this thread a while back.

Patrick
Opal | Level 21

Below code won't solve everything for you but hopefully will give you at least some inspiration.

data have;

  infile datalines truncover;

  input Company_Name $50.;

  datalines;

Hyundai USA Inc.

Hyundai USA Inc.

Hyundai-Inc.

Hyundai_Inc.

UNKNOWN Inc.

TBD

;

run;

data lookup;

  infile datalines truncover;

  input word $30.;

  datalines;

UNKNOWN

TBD

XXX

USA

;

run;

data want;

  set have;

  length Company_Name_Cleansed $50.;

  Company_Name_Cleansed=Company_Name;

  /** replace all special characters with a blank **/

  /** http://support.sas.com/documentation/cdl/en/lefunctionsref/67398/HTML/default/viewer.htm#p0s9ilagexm... **/

  Company_Name_Cleansed=prxchange('s/[^\.-[:alnum:]]/ /oi',-1,Company_Name_Cleansed);

  /** set all names with unwanted words to missing **/

  if _n_=1 then

    do;

      if 0 then set lookup;

      dcl hash h1(dataset:'lookup');

      _rc=h1.defineKey('word');

      _rc=h1.defineData('word');

      _rc=h1.defineDone();

      dcl hiter hit1('h1');

    end;

    _rc = hit1.first();

    do while (_rc = 0);

       if findw(Company_Name_Cleansed,strip(word),' ','i')>0 then

        do;

/*          call missing(Company_Name_Cleansed);*/

/*          leave;*/

          Company_Name_Cleansed=transtrn(Company_Name_Cleansed,strip(word),'');

        end;

       _rc = hit1.next();

    end;

    Company_Name_Cleansed=compbl(left(Company_Name_Cleansed));

run;

sasmaverick
Obsidian | Level 7

Hi Patrick,

Thanks for the help. Your logic is really helping me a lot. Just wondering how do I match addresses ending with digits using PRX patterns. Example. I want to flag everything ENDING with digits like "Napa Valley Street 10", "Rodeo Drive Suite 212, "Hwy 500".

Thanks again for your time.

Patrick
Opal | Level 21

Something like below could do.

data test;

  infile datalines truncover;

  input string $100.;

  if prxmatch('/\d+ *$/oi',string)> 1 then

    do;

      end_num_flg=1;

    end;

  else end_num_flg=0;

datalines;

Napa Valley Street 10

Rodeo Drive Suite 212

Hwy 500

Rodeo Drive 10 Suite

25 Hwy

some other street

;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 975 views
  • 3 likes
  • 4 in conversation