BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Alexxxxxxx
Pyrite | Level 9

Dear all,

I expect to separate the city name with a company name based on PRXMATCH function.

for example,

data HAVE;
input name &:$300.;
infile datalines missover;
datalines;
AAA LTD., NEW YORK
AAA LTD, NEW YORK
AAA L.T.D. LONDON
AAA LTD. CO., TOKYO
AAA CO.,LTD. TOKYO
;
run;

and get the following result.

namename_Baddress
AAA LTD., NEW YORKAAA LTDNEW YORK
AAA LTD, NEW YORKAAA LTDNEW YORK
AAA L.T.D. LONDONAAA LTDLONDON
AAA LTD. CO., TOKYOAAA LTD COTOKYO
AAA CO.,LTD. TOKYOAAA CO LTDTOKYO

Could you please give me some suggestions about this?

thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

The comma is bothering you because you didn't clean the string properly.

Maybe this:

data T;
  input NAME_S22 :&$200.;
  %* Clean company;
  NAME_S221=prxchange('s/ [^\w\d]*C[ .]*O(MPA[NM]Y)?[^\w\d]*?([ )])/ CO\2/',-1,NAME_S22 );
  %* Clean LTD;
  NAME_S221=prxchange('s/ [^\w\d]*L[ .]*T[ .]*D[^\w\d]*([ )])/ LTD\1/'     ,-1,NAME_S221);
  %* Clean Limited;
  NAME_S221=prxchange('s/ [^\w\d]*LIMITED[^\w\d]*([ )])/ LTD\1/'           ,-1,NAME_S221);
  %* Clean CO LTD;
  NAME_S221=prxchange('s/ (LTD[^\w\d]+CO|CO[^\w\d]+LTD)([ )])/ CO LTD\2/'  ,-1,NAME_S221);
  NAME_S221=compbl(name_S221);
cards;
ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)
run;

NAME_S221
ARCH TIMBER PROTECTION LTD A PRIVATE CO LTD ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
3M CO (MINNESOTA MINING AND MANUFACTURING CO)

View solution in original post

17 REPLIES 17
Alexxxxxxx
Pyrite | Level 9

Dear all,

 

I expect to create a list of company suffix (such as 'LTD', 'CO.,','COMPANY') and then separate the 'name' variable into two new variables, the first is 'name_B' variable and the second named as 'address'.

 

The 'address' variable is the strings following the company suffix. If the 'name' variable includes two different company suffix( such as 'CO,' and 'LTD', two observations should be created.

 

for example for table A

noname
1AAA LTD., NEW YORK
2AAA LTD, NEW YORK
3AAA LTD. LONDON
4AAA LTD. CO., TOKYO
5AAA CO.,LTD. TOKYO

I expect to get

nonamename_Baddress
1AAA LTD., NEW YORKAAA LTDNEW YORK
2AAA LTD, NEW YORKAAA LTDNEW YORK
3AAA LTD. LONDONAAA LTDLONDON
4AAA LTD. CO., TOKYOAAA LTDCO., TOKYO
4AAA LTD. CO., TOKYOAAA LTD. COTOKYO
5AAA CO.,LTD. TOKYOAAA COLTD. TOKYO
5AAA CO.,LTD. TOKYOAAA CO.,LTDTOKYO

Could you please give me some suggestion about this? thank in advance.

ChrisNZ
Tourmaline | Level 20

Do not ask the same question twice.

 

Marked as spam.

Alexxxxxxx
Pyrite | Level 9

Dear CHrisNZ,

 

thanks for your suggestion.your code works well for this question. however, it works not very well when it was run in my real data, it is a large dataset, which includes a lot of different company suffixes(such as INC, PLC), punctations (such as (3M CO)). so I think I should process them step by step. and the first step of them is to just split the city name from the old variable. That is why I post a new question.

 

they are actually not the same, because, for the new question, I just expect to separate them one by one. Could you please give me some suggestion about this? thanks very much.

ChrisNZ
Tourmaline | Level 20

The first step should be to clean your data: standardise the case, abbreviations, punctuation, spacing, brackets, etc.

Parsing comes next.

 

Alexxxxxxx
Pyrite | Level 9

Thanks @ChrisNZ 

I really appreciate your code and running it like below,

 

%MACRO CompnayNameM(no=,Company_name=);
data step9.Patstat_total_hrm_Step22;
Set step9.Patstat_total_hrm_Step22;
&Company_name._S22=&Company_name._S21;		
&Company_name._S22=cat(' ', &Company_name._S22,' ');	
&Company_name._S221=prxchange('s/( LTD[^\w\d]+CO[^\w\d]* | CO[^\w\d]+LTD[^\w\d]* )/ CO LTD /',1,&Company_name._S221 );
&Company_name._S221=prxchange('s/( ,?L ?\.? ?T ?\.? ?D ?\.? ?,? | [^\w\d]*LIMITED[^\w\d]* | [^\w\d]*LTD[^\w\d]* )/ LTD /',1,&Company_name._S221 );
&Company_name._S221=prxchange('s/( ,?C ?\.? ?O ?\.? ?,? | [^\w\d]*CO[^\w\d]* | [^\w\d]*COMPA(N|M)Y[^\w\d]* ) / CO /',1,&Company_name._S221);
&Company_name._S221=prxchange('s/( LTD +CO | CO +LTD )/ CO LTD /',1,&Company_name._S221);

if &Company_name._S221 ne &Company_name._S22 then do;
&Company_name._B=prxchange('s/(.*)( CO LTD | LTD | CO )(.*)/$1$2/',1,&Company_name._S221);
&Company_name._address=prxchange('s/(.*)( CO LTD | LTD | CO ) *(.*)/$3/',1,&Company_name._S221);
end;
run;

%MEND CompnayNameM;
%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
Run;

however, the value like '3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)' is converted to '3M COMPANY (MINNESOTA MINING AND MANUFACTURING CO'. 

 

 

is it possible to keep the strings includes (), [], {} 1 following the previous string, 2 not been processed. in this step?

for '3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)', I just expect to get '3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)' rather than split to a new variable 

 

however, for 'ACE SPORTS LIMITED TAIWAN BRANCH (B.V.I.)' , I expect to get 'ACE SPORTS LTD' and split 'TAIWAN BRANCH (B.V.I.)' in a new variable named &Company_name._address.

Could you please give me some suggestions about this?

 

edit;

besides, for value ' ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM', I get ' ARCH TIMBER PROTECTION LTD A PRIVATE LIMITED CO' and 'ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM', however, I expect to have ' ARCH TIMBER PROTECTION LTD' and 'A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM' but I know it is relatively difficult to process. so I think I can do it step by step.

 

 

In conclusion, I expect to separate the 'name' variable into two new variables, the first is 'name_B' variable and the second named as 'address'.

 

1.standardise the company suffix (which have been done by your code) and then 2.split the strings based on the standardised suffixes, If the 'name' variable includes two different company suffix( such as 'CO,' and 'LTD', two observations should be created. the strings among (), [], {}, ' ', " " should be ignored and just following the string before the '(','[','{'," ' ", ' " '. 

 

for example, for original data,

NAME
3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)
ACE SPORTS LIMITED TAIWAN BRANCH (B.V.I.)
ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM

 

I expec to do 

step1
NAME
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)
ACE SPORTS LTD TAIWAN BRANCH (B.V.I.)
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM

 

step2  

NAME

name_Bname_address
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) 
ACE SPORTS LTD TAIWAN BRANCH (B.V.I.)ACE SPORTS LTDTAIWAN BRANCH (B.V.I.)
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOMARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
ARCH TIMBER PROTECTION LTD A PRIVATE LTDARCH TIMBER PROTECTION LTD A PRIVATE LTDCO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOMARCH TIMBER PROTECTION LTD A PRIVATE LTD COORGANISED UNDER THE LAWS OF THE UNITED KINDGOM

 

Could you please give me some suggestions about how to do it? thanks very much

 

ChrisNZ
Tourmaline | Level 20

Have you looked at what these regular expression do? 

 

You have been working on that data for weeks now, and show no sign of having learnt from the work of the experts who helped you.

You can't expect strangers to come to your rescue each time a new element appears in a string. 

We give you the tools to learn, not to replace you and do your job.

 

Do you now understand what these regular expression do? 

 

Alexxxxxxx
Pyrite | Level 9

dear @ChrisNZ 

appreciate for your reply. yes, I understand what these regular expressions do, and I also rewrite some of them to fit them better for my data. for example, I use '[^\w\d]*' instead of ' ,?L ?\.? ?' in order to represent more punctations. as I have to process large data. however, the ')' is included in the '[^\w\d]*' as well. I always use Online regular expressions to try the code by myself or check the previous topic before posting the questions. I am a beginner of SAS and learn SAS several months by myself. if my question is stupid, I apologise for it.

ChrisNZ
Tourmaline | Level 20

Glad to see you learn. Good on you. 🙂

> however, the ')' in included in the '[^\w\d]*' as well. 

You can exclude  )  if you want by using  [^\w\d)] instead.

 

Alexxxxxxx
Pyrite | Level 9

Dear @ChrisNZ 

 

thanks for your reply.

 

Could you please have a look at my previous question if you are convenient? 

 

the '[^\w\d\)]*' is able to exclude the ')' but I cannot get what I need. Could you please give me more suggestions?

ChrisNZ
Tourmaline | Level 20

For the 3M string, you change the second instance of string COMPANY instead of the first one because you probably have an error: you demand 2 spaces.

Change

s/( ,?C ?\.? ?O ?\.? ?,? | [^\w\d]*CO[^\w\d]* | [^\w\d]*COMPA[NM]Y[^\w\d]* ) / CO /

to

s/( ,?C ?\.? ?O ?\.? ?,? | [^\w\d]*CO[^\w\d]* | [^\w\d]*COMPA[NM]Y[^\w\d]* )/ CO /

 

Also note that you could use function compbl() instead of a regular expression on the next line..

Alexxxxxxx
Pyrite | Level 9

Dear @ChrisNZ 

 

yes, it works well.

 

however, I could not keep the brackets and strings among brackets following the previous string. for example.

 

'3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)', I expect to have 

NAME

name_Bname_address
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) 

 

Could you please give me some suggestion about this?

ChrisNZ
Tourmaline | Level 20

What's the rule to keep substrings?

Keep it if comes in parentheses after CO or LTD?

In this case maybe

NAME_B=prxchange('s/(.*)( CO LTD | LTD | CO (\([^)]*\))?)(.*)/$1$2/',1,NAME);

instead of 

NAME_B=prxchange('s/(.*)( CO LTD | LTD | CO )(.*)/$1$2/',1,NAME);

 

Alexxxxxxx
Pyrite | Level 9

Dear @ChrisNZ 

 

appreciate your help very much.

 

may I ask two more question, please?

 

For the value 'ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM'

 

I expect to get

NAMEname_Bname_address
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOMARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM

 

however,

%MACRO CompnayNameM(no=,Company_name=);
data have;
Set have;
name_S22=name;		
name_S22=cat(' ', name_S22,' ');	
name_S221=name_S22;
name_S221=prxchange('s/( LTD[^\w\d]+CO[^\w\d]* | CO[^\w\d]+LTD[^\w\d]* )/ CO LTD /',-1,name_S221 );
name_S221=prxchange('s/( ,?L ?\.? ?T ?\.? ?D ?\.? ?,? | [^\w\d]*LIMITED[^\w\d]* | [^\w\d]*LTD[^\w\d]* )/ LTD /',-1,name_S221 );
name_S221=prxchange('s/( ,?C ?\.? ?O ?\.? ?,? | [^\w\d]*CO[^\w\d]* | [^\w\d]*COMPA(N|M)Y[^\w\d]* )/ CO /',-1,name_S221);
name_S221=compbl(name_S221);
name_S221=prxchange('s/( LTD CO | CO LTD )/ CO LTD /',-1,name_S221);

if name_S221 ne name_S22 then do;
name_B=prxchange('s/(.*)( CO LTD | LTD | CO (\([^)]*\))*(\[[^]]*\])*(\{[^}]*\})*(\"[^"]*\")*)(.*)/$1$2/',1,name_S221);
name_address=prxchange('s/(.*)( CO LTD | LTD | CO (\([^)]*\))*(\[[^]]*\])*(\{[^}]*\})*(\"[^"]*\")*) *(.*)/$3/',1,name_S221);
end;
run;

%MEND CompnayNameM;
%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
Run;

I only get 

namename_S22name_S221name_Bname_address
ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOMARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOMARCH TIMBER PROTECTION LTD A PRIVATE CO LTD ORGANISED UNDER THE LAWS OF THE UNITED KINDGOMARCH TIMBER PROTECTION LTD A PRIVATE CO LTD 

How can I stop the process when meet the first company suffix? 

 

Besides, the  (\'[^']*\')* cannot be added in the prxchange function.

Could you please give me some suggestion about this?

 

ChrisNZ
Tourmaline | Level 20

The comma is bothering you because you didn't clean the string properly.

Maybe this:

data T;
  input NAME_S22 :&$200.;
  %* Clean company;
  NAME_S221=prxchange('s/ [^\w\d]*C[ .]*O(MPA[NM]Y)?[^\w\d]*?([ )])/ CO\2/',-1,NAME_S22 );
  %* Clean LTD;
  NAME_S221=prxchange('s/ [^\w\d]*L[ .]*T[ .]*D[^\w\d]*([ )])/ LTD\1/'     ,-1,NAME_S221);
  %* Clean Limited;
  NAME_S221=prxchange('s/ [^\w\d]*LIMITED[^\w\d]*([ )])/ LTD\1/'           ,-1,NAME_S221);
  %* Clean CO LTD;
  NAME_S221=prxchange('s/ (LTD[^\w\d]+CO|CO[^\w\d]+LTD)([ )])/ CO LTD\2/'  ,-1,NAME_S221);
  NAME_S221=compbl(name_S221);
cards;
ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)
run;

NAME_S221
ARCH TIMBER PROTECTION LTD A PRIVATE CO LTD ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
3M CO (MINNESOTA MINING AND MANUFACTURING CO)

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 17 replies
  • 1389 views
  • 5 likes
  • 2 in conversation