Solved: separate the city with a company name based on PRXMATCH function

Alexxxxxxx · Posted 05-06-2019 03:28 PM

Dear all,

I expect to separate the city name with a company name based on PRXMATCH function.

for example,

data HAVE;
input name &:$300.;
infile datalines missover;
datalines;
AAA LTD., NEW YORK
AAA LTD, NEW YORK
AAA L.T.D. LONDON
AAA LTD. CO., TOKYO
AAA CO.,LTD. TOKYO
;
run;

and get the following result.

name	name_B	address
AAA LTD., NEW YORK	AAA LTD	NEW YORK
AAA LTD, NEW YORK	AAA LTD	NEW YORK
AAA L.T.D. LONDON	AAA LTD	LONDON
AAA LTD. CO., TOKYO	AAA LTD CO	TOKYO
AAA CO.,LTD. TOKYO	AAA CO LTD	TOKYO

Could you please give me some suggestions about this?

thanks in advance.

ChrisNZ · Posted 05-10-2019 12:59 AM

The comma is bothering you because you didn't clean the string properly.

Maybe this:

data T;
  input NAME_S22 :&$200.;
  %* Clean company;
  NAME_S221=prxchange('s/ [^\w\d]*C[ .]*O(MPA[NM]Y)?[^\w\d]*?([ )])/ CO\2/',-1,NAME_S22 );
  %* Clean LTD;
  NAME_S221=prxchange('s/ [^\w\d]*L[ .]*T[ .]*D[^\w\d]*([ )])/ LTD\1/'     ,-1,NAME_S221);
  %* Clean Limited;
  NAME_S221=prxchange('s/ [^\w\d]*LIMITED[^\w\d]*([ )])/ LTD\1/'           ,-1,NAME_S221);
  %* Clean CO LTD;
  NAME_S221=prxchange('s/ (LTD[^\w\d]+CO|CO[^\w\d]+LTD)([ )])/ CO LTD\2/'  ,-1,NAME_S221);
  NAME_S221=compbl(name_S221);
cards;
ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)
run;

NAME_S221
ARCH TIMBER PROTECTION LTD A PRIVATE CO LTD ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
3M CO (MINNESOTA MINING AND MANUFACTURING CO)

High-Performance SAS Coding - Third Edition

View solution in original post

Alexxxxxxx · Posted 05-06-2019 11:14 PM

Dear all,

I expect to create a list of company suffix (such as 'LTD', 'CO.,','COMPANY') and then separate the 'name' variable into two new variables, the first is 'name_B' variable and the second named as 'address'.

The 'address' variable is the strings following the company suffix. If the 'name' variable includes two different company suffix( such as 'CO,' and 'LTD', two observations should be created.

for example for table A

no	name
1	AAA LTD., NEW YORK
2	AAA LTD, NEW YORK
3	AAA LTD. LONDON
4	AAA LTD. CO., TOKYO
5	AAA CO.,LTD. TOKYO

I expect to get

no	name	name_B	address
1	AAA LTD., NEW YORK	AAA LTD	NEW YORK
2	AAA LTD, NEW YORK	AAA LTD	NEW YORK
3	AAA LTD. LONDON	AAA LTD	LONDON
4	AAA LTD. CO., TOKYO	AAA LTD	CO., TOKYO
4	AAA LTD. CO., TOKYO	AAA LTD. CO	TOKYO
5	AAA CO.,LTD. TOKYO	AAA CO	LTD. TOKYO
5	AAA CO.,LTD. TOKYO	AAA CO.,LTD	TOKYO

Could you please give me some suggestion about this? thank in advance.

ChrisNZ · Posted 05-07-2019 12:25 AM

Do not ask the same question twice.

Marked as spam.

High-Performance SAS Coding - Third Edition

Alexxxxxxx · Posted 05-07-2019 07:10 PM

Dear CHrisNZ,

thanks for your suggestion.your code works well for this question. however, it works not very well when it was run in my real data, it is a large dataset, which includes a lot of different company suffixes(such as INC, PLC), punctations (such as (3M CO)). so I think I should process them step by step. and the first step of them is to just split the city name from the old variable. That is why I post a new question.

they are actually not the same, because, for the new question, I just expect to separate them one by one. Could you please give me some suggestion about this? thanks very much.

ChrisNZ · Posted 05-07-2019 07:21 PM

The first step should be to clean your data: standardise the case, abbreviations, punctuation, spacing, brackets, etc.

Parsing comes next.

High-Performance SAS Coding - Third Edition

Alexxxxxxx · Posted 05-07-2019 08:03 PM

Thanks @ChrisNZ

I really appreciate your code and running it like below,

%MACRO CompnayNameM(no=,Company_name=);
data step9.Patstat_total_hrm_Step22;
Set step9.Patstat_total_hrm_Step22;
&Company_name._S22=&Company_name._S21;		
&Company_name._S22=cat(' ', &Company_name._S22,' ');	
&Company_name._S221=prxchange('s/( LTD[^\w\d]+CO[^\w\d]* | CO[^\w\d]+LTD[^\w\d]* )/ CO LTD /',1,&Company_name._S221 );
&Company_name._S221=prxchange('s/( ,?L ?\.? ?T ?\.? ?D ?\.? ?,? | [^\w\d]*LIMITED[^\w\d]* | [^\w\d]*LTD[^\w\d]* )/ LTD /',1,&Company_name._S221 );
&Company_name._S221=prxchange('s/( ,?C ?\.? ?O ?\.? ?,? | [^\w\d]*CO[^\w\d]* | [^\w\d]*COMPA(N|M)Y[^\w\d]* ) / CO /',1,&Company_name._S221);
&Company_name._S221=prxchange('s/( LTD +CO | CO +LTD )/ CO LTD /',1,&Company_name._S221);

if &Company_name._S221 ne &Company_name._S22 then do;
&Company_name._B=prxchange('s/(.*)( CO LTD | LTD | CO )(.*)/$1$2/',1,&Company_name._S221);
&Company_name._address=prxchange('s/(.*)( CO LTD | LTD | CO ) *(.*)/$3/',1,&Company_name._S221);
end;
run;

%MEND CompnayNameM;
%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
Run;

however, the value like '3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)' is converted to '3M COMPANY (MINNESOTA MINING AND MANUFACTURING CO'.

is it possible to keep the strings includes (), [], {} 1 following the previous string, 2 not been processed. in this step?

for '3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)', I just expect to get '3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)' rather than split to a new variable

however, for 'ACE SPORTS LIMITED TAIWAN BRANCH (B.V.I.)' , I expect to get 'ACE SPORTS LTD' and split 'TAIWAN BRANCH (B.V.I.)' in a new variable named &Company_name._address.

Could you please give me some suggestions about this?

edit;

besides, for value ' ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM', I get ' ARCH TIMBER PROTECTION LTD A PRIVATE LIMITED CO' and 'ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM', however, I expect to have ' ARCH TIMBER PROTECTION LTD' and 'A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM' but I know it is relatively difficult to process. so I think I can do it step by step.

In conclusion, I expect to separate the 'name' variable into two new variables, the first is 'name_B' variable and the second named as 'address'.

1.standardise the company suffix (which have been done by your code) and then 2.split the strings based on the standardised suffixes, If the 'name' variable includes two different company suffix( such as 'CO,' and 'LTD', two observations should be created. the strings among (), [], {}, ' ', " " should be ignored and just following the string before the '(','[','{'," ' ", ' " '.

for example, for original data,

NAME

3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)

ACE SPORTS LIMITED TAIWAN BRANCH (B.V.I.)

ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM

I expec to do

step1

NAME

3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)

ACE SPORTS LTD TAIWAN BRANCH (B.V.I.)

ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM

step2
NAME	name_B	name_address
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)	3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)
ACE SPORTS LTD TAIWAN BRANCH (B.V.I.)	ACE SPORTS LTD	TAIWAN BRANCH (B.V.I.)
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM	ARCH TIMBER PROTECTION LTD	A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
ARCH TIMBER PROTECTION LTD A PRIVATE LTD	ARCH TIMBER PROTECTION LTD A PRIVATE LTD	CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM	ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO	ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM

Could you please give me some suggestions about how to do it? thanks very much

ChrisNZ · Posted 05-07-2019 11:04 PM

Have you looked at what these regular expression do?

You have been working on that data for weeks now, and show no sign of having learnt from the work of the experts who helped you.

You can't expect strangers to come to your rescue each time a new element appears in a string.

We give you the tools to learn, not to replace you and do your job.

Do you now understand what these regular expression do?

High-Performance SAS Coding - Third Edition

Alexxxxxxx · Posted 05-07-2019 11:17 PM

dear @ChrisNZ

appreciate for your reply. yes, I understand what these regular expressions do, and I also rewrite some of them to fit them better for my data. for example, I use '[^\w\d]*' instead of ' ,?L ?\.? ?' in order to represent more punctations. as I have to process large data. however, the ')' is included in the '[^\w\d]*' as well. I always use Online regular expressions to try the code by myself or check the previous topic before posting the questions. I am a beginner of SAS and learn SAS several months by myself. if my question is stupid, I apologise for it.

ChrisNZ · Posted 05-07-2019 11:22 PM

Glad to see you learn. Good on you. 🙂

> however, the ')' in included in the '[^\w\d]*' as well.

You can exclude ) if you want by using [^\w\d)] instead.

High-Performance SAS Coding - Third Edition

Alexxxxxxx · Posted 05-07-2019 11:40 PM

Dear @ChrisNZ

thanks for your reply.

Could you please have a look at my previous question if you are convenient?

the '[^\w\d\)]*' is able to exclude the ')' but I cannot get what I need. Could you please give me more suggestions?

ChrisNZ · Posted 05-08-2019 12:40 AM

For the 3M string, you change the second instance of string COMPANY instead of the first one because you probably have an error: you demand 2 spaces.

Change

s/( ,?C ?\.? ?O ?\.? ?,? | [^\w\d]*CO[^\w\d]* | [^\w\d]*COMPA[NM]Y[^\w\d]* ) / CO /

to

s/( ,?C ?\.? ?O ?\.? ?,? | [^\w\d]*CO[^\w\d]* | [^\w\d]*COMPA[NM]Y[^\w\d]* )/ CO /

Also note that you could use function compbl() instead of a regular expression on the next line..

High-Performance SAS Coding - Third Edition

Alexxxxxxx · Posted 05-08-2019 08:50 AM

Dear @ChrisNZ

yes, it works well.

however, I could not keep the brackets and strings among brackets following the previous string. for example.

'3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)', I expect to have

NAME	name_B	name_address
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)	3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)

Could you please give me some suggestion about this?

ChrisNZ · Posted 05-08-2019 06:59 PM

What's the rule to keep substrings?

Keep it if comes in parentheses after CO or LTD?

In this case maybe

NAME_B=prxchange('s/(.*)( CO LTD | LTD | CO ($[^)]*$)?)(.*)/$1$2/',1,NAME);

instead of

NAME_B=prxchange('s/(.*)( CO LTD | LTD | CO )(.*)/$1$2/',1,NAME);

High-Performance SAS Coding - Third Edition

Alexxxxxxx · Posted 05-09-2019 04:34 AM

Dear @ChrisNZ

appreciate your help very much.

may I ask two more question, please?

For the value 'ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM'

I expect to get

NAME	name_B	name_address
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM	ARCH TIMBER PROTECTION LTD	A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM

however,

%MACRO CompnayNameM(no=,Company_name=);
data have;
Set have;
name_S22=name;		
name_S22=cat(' ', name_S22,' ');	
name_S221=name_S22;
name_S221=prxchange('s/( LTD[^\w\d]+CO[^\w\d]* | CO[^\w\d]+LTD[^\w\d]* )/ CO LTD /',-1,name_S221 );
name_S221=prxchange('s/( ,?L ?\.? ?T ?\.? ?D ?\.? ?,? | [^\w\d]*LIMITED[^\w\d]* | [^\w\d]*LTD[^\w\d]* )/ LTD /',-1,name_S221 );
name_S221=prxchange('s/( ,?C ?\.? ?O ?\.? ?,? | [^\w\d]*CO[^\w\d]* | [^\w\d]*COMPA(N|M)Y[^\w\d]* )/ CO /',-1,name_S221);
name_S221=compbl(name_S221);
name_S221=prxchange('s/( LTD CO | CO LTD )/ CO LTD /',-1,name_S221);

if name_S221 ne name_S22 then do;
name_B=prxchange('s/(.*)( CO LTD | LTD | CO (\([^)]*\))*(\[[^]]*\])*(\{[^}]*\})*(\"[^"]*\")*)(.*)/$1$2/',1,name_S221);
name_address=prxchange('s/(.*)( CO LTD | LTD | CO (\([^)]*\))*(\[[^]]*\])*(\{[^}]*\})*(\"[^"]*\")*) *(.*)/$3/',1,name_S221);
end;
run;

%MEND CompnayNameM;
%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
Run;

I only get

name	name_S22	name_S221	name_B	name_address
ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM	ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM	ARCH TIMBER PROTECTION LTD A PRIVATE CO LTD ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM	ARCH TIMBER PROTECTION LTD A PRIVATE CO LTD

How can I stop the process when meet the first company suffix?

Besides, the (\'[^']*\')* cannot be added in the prxchange function.

Could you please give me some suggestion about this?

ChrisNZ · Posted 05-10-2019 12:59 AM

The comma is bothering you because you didn't clean the string properly.

Maybe this:

data T;
  input NAME_S22 :&$200.;
  %* Clean company;
  NAME_S221=prxchange('s/ [^\w\d]*C[ .]*O(MPA[NM]Y)?[^\w\d]*?([ )])/ CO\2/',-1,NAME_S22 );
  %* Clean LTD;
  NAME_S221=prxchange('s/ [^\w\d]*L[ .]*T[ .]*D[^\w\d]*([ )])/ LTD\1/'     ,-1,NAME_S221);
  %* Clean Limited;
  NAME_S221=prxchange('s/ [^\w\d]*LIMITED[^\w\d]*([ )])/ LTD\1/'           ,-1,NAME_S221);
  %* Clean CO LTD;
  NAME_S221=prxchange('s/ (LTD[^\w\d]+CO|CO[^\w\d]+LTD)([ )])/ CO LTD\2/'  ,-1,NAME_S221);
  NAME_S221=compbl(name_S221);
cards;
ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)
run;

NAME_S221
ARCH TIMBER PROTECTION LTD A PRIVATE CO LTD ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
3M CO (MINNESOTA MINING AND MANUFACTURING CO)

High-Performance SAS Coding - Third Edition

separate the city with a company name based on PRXMATCH function

Re: separate the 'name' variable based on a list of company suffix

separate the 'name' variable based on a list of company suffix

Re: separate the 'name' variable based on a list of company suffix

Re: separate the 'name' variable based on a list of company suffix

Re: separate the 'name' variable based on a list of company suffix

Re: separate the 'name' variable based on a list of company suffix

Re: separate the 'name' variable based on a list of company suffix

Re: separate the 'name' variable based on a list of company suffix

Re: separate the 'name' variable based on a list of company suffix

Re: separate the 'name' variable based on a list of company suffix

Re: separate the 'name' variable based on a list of company suffix

Re: separate the 'name' variable based on a list of company suffix

Re: separate the 'name' variable based on a list of company suffix

Re: separate the 'name' variable based on a list of company suffix

Re: separate the 'name' variable based on a list of company suffix

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away