Dear all,
I expect to separate the city name with a company name based on PRXMATCH function.
for example,
data HAVE;
input name &:$300.;
infile datalines missover;
datalines;
AAA LTD., NEW YORK
AAA LTD, NEW YORK
AAA L.T.D. LONDON
AAA LTD. CO., TOKYO
AAA CO.,LTD. TOKYO
;
run;
and get the following result.
name | name_B | address |
AAA LTD., NEW YORK | AAA LTD | NEW YORK |
AAA LTD, NEW YORK | AAA LTD | NEW YORK |
AAA L.T.D. LONDON | AAA LTD | LONDON |
AAA LTD. CO., TOKYO | AAA LTD CO | TOKYO |
AAA CO.,LTD. TOKYO | AAA CO LTD | TOKYO |
Could you please give me some suggestions about this?
thanks in advance.
The comma is bothering you because you didn't clean the string properly.
Maybe this:
data T;
input NAME_S22 :&$200.;
%* Clean company;
NAME_S221=prxchange('s/ [^\w\d]*C[ .]*O(MPA[NM]Y)?[^\w\d]*?([ )])/ CO\2/',-1,NAME_S22 );
%* Clean LTD;
NAME_S221=prxchange('s/ [^\w\d]*L[ .]*T[ .]*D[^\w\d]*([ )])/ LTD\1/' ,-1,NAME_S221);
%* Clean Limited;
NAME_S221=prxchange('s/ [^\w\d]*LIMITED[^\w\d]*([ )])/ LTD\1/' ,-1,NAME_S221);
%* Clean CO LTD;
NAME_S221=prxchange('s/ (LTD[^\w\d]+CO|CO[^\w\d]+LTD)([ )])/ CO LTD\2/' ,-1,NAME_S221);
NAME_S221=compbl(name_S221);
cards;
ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)
run;
NAME_S221 |
---|
ARCH TIMBER PROTECTION LTD A PRIVATE CO LTD ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM |
3M CO (MINNESOTA MINING AND MANUFACTURING CO) |
Dear all,
I expect to create a list of company suffix (such as 'LTD', 'CO.,','COMPANY') and then separate the 'name' variable into two new variables, the first is 'name_B' variable and the second named as 'address'.
The 'address' variable is the strings following the company suffix. If the 'name' variable includes two different company suffix( such as 'CO,' and 'LTD', two observations should be created.
for example for table A
no | name |
1 | AAA LTD., NEW YORK |
2 | AAA LTD, NEW YORK |
3 | AAA LTD. LONDON |
4 | AAA LTD. CO., TOKYO |
5 | AAA CO.,LTD. TOKYO |
I expect to get
no | name | name_B | address |
1 | AAA LTD., NEW YORK | AAA LTD | NEW YORK |
2 | AAA LTD, NEW YORK | AAA LTD | NEW YORK |
3 | AAA LTD. LONDON | AAA LTD | LONDON |
4 | AAA LTD. CO., TOKYO | AAA LTD | CO., TOKYO |
4 | AAA LTD. CO., TOKYO | AAA LTD. CO | TOKYO |
5 | AAA CO.,LTD. TOKYO | AAA CO | LTD. TOKYO |
5 | AAA CO.,LTD. TOKYO | AAA CO.,LTD | TOKYO |
Could you please give me some suggestion about this? thank in advance.
Dear CHrisNZ,
thanks for your suggestion.your code works well for this question. however, it works not very well when it was run in my real data, it is a large dataset, which includes a lot of different company suffixes(such as INC, PLC), punctations (such as (3M CO)). so I think I should process them step by step. and the first step of them is to just split the city name from the old variable. That is why I post a new question.
they are actually not the same, because, for the new question, I just expect to separate them one by one. Could you please give me some suggestion about this? thanks very much.
The first step should be to clean your data: standardise the case, abbreviations, punctuation, spacing, brackets, etc.
Parsing comes next.
Thanks @ChrisNZ
I really appreciate your code and running it like below,
%MACRO CompnayNameM(no=,Company_name=);
data step9.Patstat_total_hrm_Step22;
Set step9.Patstat_total_hrm_Step22;
&Company_name._S22=&Company_name._S21;
&Company_name._S22=cat(' ', &Company_name._S22,' ');
&Company_name._S221=prxchange('s/( LTD[^\w\d]+CO[^\w\d]* | CO[^\w\d]+LTD[^\w\d]* )/ CO LTD /',1,&Company_name._S221 );
&Company_name._S221=prxchange('s/( ,?L ?\.? ?T ?\.? ?D ?\.? ?,? | [^\w\d]*LIMITED[^\w\d]* | [^\w\d]*LTD[^\w\d]* )/ LTD /',1,&Company_name._S221 );
&Company_name._S221=prxchange('s/( ,?C ?\.? ?O ?\.? ?,? | [^\w\d]*CO[^\w\d]* | [^\w\d]*COMPA(N|M)Y[^\w\d]* ) / CO /',1,&Company_name._S221);
&Company_name._S221=prxchange('s/( LTD +CO | CO +LTD )/ CO LTD /',1,&Company_name._S221);
if &Company_name._S221 ne &Company_name._S22 then do;
&Company_name._B=prxchange('s/(.*)( CO LTD | LTD | CO )(.*)/$1$2/',1,&Company_name._S221);
&Company_name._address=prxchange('s/(.*)( CO LTD | LTD | CO ) *(.*)/$3/',1,&Company_name._S221);
end;
run;
%MEND CompnayNameM;
%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
Run;
however, the value like '3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)' is converted to '3M COMPANY (MINNESOTA MINING AND MANUFACTURING CO'.
is it possible to keep the strings includes (), [], {} 1 following the previous string, 2 not been processed. in this step?
for '3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)', I just expect to get '3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)' rather than split to a new variable
however, for 'ACE SPORTS LIMITED TAIWAN BRANCH (B.V.I.)' , I expect to get 'ACE SPORTS LTD' and split 'TAIWAN BRANCH (B.V.I.)' in a new variable named &Company_name._address.
Could you please give me some suggestions about this?
edit;
besides, for value ' ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM', I get ' ARCH TIMBER PROTECTION LTD A PRIVATE LIMITED CO' and 'ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM', however, I expect to have ' ARCH TIMBER PROTECTION LTD' and 'A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM' but I know it is relatively difficult to process. so I think I can do it step by step.
In conclusion, I expect to separate the 'name' variable into two new variables, the first is 'name_B' variable and the second named as 'address'.
1.standardise the company suffix (which have been done by your code) and then 2.split the strings based on the standardised suffixes, If the 'name' variable includes two different company suffix( such as 'CO,' and 'LTD', two observations should be created. the strings among (), [], {}, ' ', " " should be ignored and just following the string before the '(','[','{'," ' ", ' " '.
for example, for original data,
NAME |
3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY) |
ACE SPORTS LIMITED TAIWAN BRANCH (B.V.I.) |
ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM |
I expec to do
step1 |
NAME |
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) |
ACE SPORTS LTD TAIWAN BRANCH (B.V.I.) |
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM |
step2 | ||
NAME | name_B | name_address |
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) | 3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) | |
ACE SPORTS LTD TAIWAN BRANCH (B.V.I.) | ACE SPORTS LTD | TAIWAN BRANCH (B.V.I.) |
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM | ARCH TIMBER PROTECTION LTD | A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM |
ARCH TIMBER PROTECTION LTD A PRIVATE LTD | ARCH TIMBER PROTECTION LTD A PRIVATE LTD | CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM |
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM | ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO | ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM |
Could you please give me some suggestions about how to do it? thanks very much
Have you looked at what these regular expression do?
You have been working on that data for weeks now, and show no sign of having learnt from the work of the experts who helped you.
You can't expect strangers to come to your rescue each time a new element appears in a string.
We give you the tools to learn, not to replace you and do your job.
Do you now understand what these regular expression do?
dear @ChrisNZ
appreciate for your reply. yes, I understand what these regular expressions do, and I also rewrite some of them to fit them better for my data. for example, I use '[^\w\d]*' instead of ' ,?L ?\.? ?' in order to represent more punctations. as I have to process large data. however, the ')' is included in the '[^\w\d]*' as well. I always use Online regular expressions to try the code by myself or check the previous topic before posting the questions. I am a beginner of SAS and learn SAS several months by myself. if my question is stupid, I apologise for it.
Glad to see you learn. Good on you. 🙂
> however, the ')' in included in the '[^\w\d]*' as well.
You can exclude ) if you want by using [^\w\d)] instead.
Dear @ChrisNZ
thanks for your reply.
Could you please have a look at my previous question if you are convenient?
the '[^\w\d\)]*' is able to exclude the ')' but I cannot get what I need. Could you please give me more suggestions?
For the 3M string, you change the second instance of string COMPANY instead of the first one because you probably have an error: you demand 2 spaces.
Change
s/( ,?C ?\.? ?O ?\.? ?,? | [^\w\d]*CO[^\w\d]* | [^\w\d]*COMPA[NM]Y[^\w\d]* ) / CO /
to
s/( ,?C ?\.? ?O ?\.? ?,? | [^\w\d]*CO[^\w\d]* | [^\w\d]*COMPA[NM]Y[^\w\d]* )/ CO /
Also note that you could use function compbl() instead of a regular expression on the next line..
Dear @ChrisNZ
yes, it works well.
however, I could not keep the brackets and strings among brackets following the previous string. for example.
'3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)', I expect to have
NAME | name_B | name_address |
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) | 3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) |
Could you please give me some suggestion about this?
What's the rule to keep substrings?
Keep it if comes in parentheses after CO or LTD?
In this case maybe
NAME_B=prxchange('s/(.*)( CO LTD | LTD | CO (\([^)]*\))?)(.*)/$1$2/',1,NAME);
instead of
NAME_B=prxchange('s/(.*)( CO LTD | LTD | CO )(.*)/$1$2/',1,NAME);
Dear @ChrisNZ
appreciate your help very much.
may I ask two more question, please?
For the value 'ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM'
I expect to get
NAME | name_B | name_address |
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM | ARCH TIMBER PROTECTION LTD | A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM |
however,
%MACRO CompnayNameM(no=,Company_name=);
data have;
Set have;
name_S22=name;
name_S22=cat(' ', name_S22,' ');
name_S221=name_S22;
name_S221=prxchange('s/( LTD[^\w\d]+CO[^\w\d]* | CO[^\w\d]+LTD[^\w\d]* )/ CO LTD /',-1,name_S221 );
name_S221=prxchange('s/( ,?L ?\.? ?T ?\.? ?D ?\.? ?,? | [^\w\d]*LIMITED[^\w\d]* | [^\w\d]*LTD[^\w\d]* )/ LTD /',-1,name_S221 );
name_S221=prxchange('s/( ,?C ?\.? ?O ?\.? ?,? | [^\w\d]*CO[^\w\d]* | [^\w\d]*COMPA(N|M)Y[^\w\d]* )/ CO /',-1,name_S221);
name_S221=compbl(name_S221);
name_S221=prxchange('s/( LTD CO | CO LTD )/ CO LTD /',-1,name_S221);
if name_S221 ne name_S22 then do;
name_B=prxchange('s/(.*)( CO LTD | LTD | CO (\([^)]*\))*(\[[^]]*\])*(\{[^}]*\})*(\"[^"]*\")*)(.*)/$1$2/',1,name_S221);
name_address=prxchange('s/(.*)( CO LTD | LTD | CO (\([^)]*\))*(\[[^]]*\])*(\{[^}]*\})*(\"[^"]*\")*) *(.*)/$3/',1,name_S221);
end;
run;
%MEND CompnayNameM;
%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
Run;
I only get
name | name_S22 | name_S221 | name_B | name_address |
ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM | ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM | ARCH TIMBER PROTECTION LTD A PRIVATE CO LTD ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM | ARCH TIMBER PROTECTION LTD A PRIVATE CO LTD |
How can I stop the process when meet the first company suffix?
Besides, the (\'[^']*\')* cannot be added in the prxchange function.
Could you please give me some suggestion about this?
The comma is bothering you because you didn't clean the string properly.
Maybe this:
data T;
input NAME_S22 :&$200.;
%* Clean company;
NAME_S221=prxchange('s/ [^\w\d]*C[ .]*O(MPA[NM]Y)?[^\w\d]*?([ )])/ CO\2/',-1,NAME_S22 );
%* Clean LTD;
NAME_S221=prxchange('s/ [^\w\d]*L[ .]*T[ .]*D[^\w\d]*([ )])/ LTD\1/' ,-1,NAME_S221);
%* Clean Limited;
NAME_S221=prxchange('s/ [^\w\d]*LIMITED[^\w\d]*([ )])/ LTD\1/' ,-1,NAME_S221);
%* Clean CO LTD;
NAME_S221=prxchange('s/ (LTD[^\w\d]+CO|CO[^\w\d]+LTD)([ )])/ CO LTD\2/' ,-1,NAME_S221);
NAME_S221=compbl(name_S221);
cards;
ARCH TIMBER PROTECTION LIMITED, A PRIVATE LIMITED COMPANY ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
3M COMPANY (MINNESOTA MINING AND MANUFACTURING COMPANY)
run;
NAME_S221 |
---|
ARCH TIMBER PROTECTION LTD A PRIVATE CO LTD ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM |
3M CO (MINNESOTA MINING AND MANUFACTURING CO) |
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.