Dear all,
How can I split strings based on a list of specific strings (i.e., LTD, CO, LTD CO) and stop the process when first meeting them.
during the process,
1.I expect not to process brackets( i.e., (), [], {}, ' ', " ") and strings among brackets during the process.
2.split the substrings to a new variable 'NAME_address'
for example,
data HAVE;
input NAME: &$500.;
cards;
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
3M CO "MINNESOTA MINING AND MANUFACTURING CO"
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
run;
I expect to get the variable 'NAME_B' and 'NAME_address' from variable 'NAME'
NAME | NAME_B | NAME_address |
3M CO (MINNESOTA MINING AND MANUFACTURING CO) | 3M CO (MINNESOTA MINING AND MANUFACTURING CO) | |
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' | 3M CO 'MINNESOTA MINING AND MANUFACTURING CO' | |
3M CO "MINNESOTA MINING AND MANUFACTURING CO" | 3M CO "MINNESOTA MINING AND MANUFACTURING CO" | |
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK | 3M CO (MINNESOTA MINING AND MANUFACTURING CO) | A LTD IN UK |
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO | 3M CO 'MINNESOTA MINING AND MANUFACTURING CO' | A UK CO |
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD | 3M CO "MINNESOTA MINING AND MANUFACTURING CO" | A UK LTD |
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM | ARCH TIMBER PROTECTION LTD | A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM |
Could you please give me some suggestions about this? thanks in advance.
This should get you ahead.
data HAVE;
input NAME :& $500.;
cards;
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
3M CO "MINNESOTA MINING AND MANUFACTURING CO"
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
run;
data WANT;
set HAVE;
array SUFFIX[22] $6 _temporary_(
'LTD '
'CO ' 'CO LTD'
'INC '
'PLC '
'SPA '
'LLP '
'LLC '
'LC '
'AB '
'AG '
'SA '
'SAS '
'SAL '
'SPA '
'SRO '
'SRL '
'GMBH '
'KK '
'BV '
'CP '
'NL '
) ;
do I=1 to 26;
STR= '/(.*? '||trim(SUFFIX[I])||'( \(.*\)| ".*"| ''.*'')? )/';
REGEX= prxparse(STR);
call prxsubstr(REGEX,NAME,POS,LEN);
if POS then do;
SUBSTR1=substr(NAME,POS,LEN);
output;
leave;
end;
end;
run;
SUBSTR1 |
---|
3M CO (MINNESOTA MINING AND MANUFACTURING CO) |
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' |
3M CO "MINNESOTA MINING AND MANUFACTURING CO" |
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD |
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO |
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD |
ARCH TIMBER PROTECTION LTD |
Before you even attempt to code for this you need to formulate the rules (exhaustive).
You show us what you want but I wouldn't know how to come up with a set of rules which also covers your last case not only for exactly your sample data.
Dear @Patrick
thanks for your reply. I expect to use the code to clean the company name.
so, I would like to separate 'NAME' variable into two new variables, the first is 'NAME_B' variable which only includes strings relevant to the company name, the second variable is 'NAME_address' variable which only includes strings relevant to address.
by using the following company name in table 1 as an example
table1 |
NAME |
3M CO (MINNESOTA MINING AND MANUFACTURING CO) |
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' |
3M CO "MINNESOTA MINING AND MANUFACTURING CO" |
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK |
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO |
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD |
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM |
for value 3M CO (MINNESOTA MINING AND MANUFACTURING CO), '3M CO' is the abbreviation of 'MINNESOTA MINING AND MANUFACTURING CO', and I expect to put them together into 'NAME_B'.
for value 3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO, " 3M CO 'MINNESOTA MINING AND MANUFACTURING CO' " should belong to 'NAME_B' and 'A UK CO' is an introduction of 3M company, which I expect to split into a new variable named 'NAME_address'.
I expect to separate the 'NAME' variable based on 'CO', 'LTD' and 'LTD CO', as they are usually the ending value of a company name.
However, for value 'ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM',
'ARCH TIMBER PROTECTION LTD' is the name of the company, and 'A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM' is the introduction of the company, so I expect to split them into two different variables, which are 'NAME_B', 'NAME_address'. But they both include 'LTD' and 'LTD CO', so I expect to stop the process after getting the first company suffix ( in this example, it is 'LTD').
that is what I expect to do. Could you please give me some suggestions about this, please let me know if you have any question.
thanks in advance.
You showed us first how the desired result should look like. Now you explain us the reasons why you want the result to look this way. BUT: That still doesn't give us the RULES how to split strings to create such a result.
I assume you're after code which also works for other names and not just exactly for what you've posted.
For example for below:
for value 3M CO (MINNESOTA MINING AND MANUFACTURING CO), '3M CO' is the abbreviation of 'MINNESOTA MINING AND MANUFACTURING CO', and I expect to put them together into 'NAME_B'.
What's the rule for a computer to "know" that this is an abbreviation? Do you have somewhere a list of such abbreviations where we could look up?
I expect to separate the 'NAME' variable based on 'CO', 'LTD' and 'LTD CO', as they are usually the ending value of a company name.
"USUALLY" might give you the generic rule - but then you need clear rules for all the exceptions. Looking at your sample data it also appears that you don't want to just split before LTD or CO but also want to include some more "leading" words.
Based on your sample data a rule could be to split after the first CO immediately followed by a non-letter character. But I guess that would only be appropriate for your sample data and we would still need some exception rules for your last case and for "abbreviations".
So... Try to formulate rules which are exclusive and exhaustive and post them. Then we can help you with the coding.
dear @Patrick
thanks for your reply.
yes, it is my expectation basically.
a rule could be to split after the first CO immediately followed by a non-letter character.
except for
non-letter character.
I expect not followed by a non-letter character, but these five group characters, i.e., (), [], {}, ' ', " ", and strings among them. such as
<span class="token string"><span class="token punctuation">(</span>MINNESOTA MINING AND MANUFACTURING CO<span class="token punctuation">)<br />[MINNESOTA MINING AND MANUFACTURING CO]<br /></span><span class="token punctuation">{MINNESOTA MINING AND MANUFACTURING CO}</span><br />'MINNESOTA MINING AND MANUFACTURING CO'</span> <br />
rather than
(MINNESOTA MINING AND MANUFACTURING CO
or
(MINNESOTA MINING AND MANUFACTURING CO'
Besides, the list of company suffix except for 'CO' is
data HAVE;
input NAME: &$500.;
cards;
LTD
CO LTD
INC
PLC
SPA
LLP
LLC
LC
AB
AG
SA
SAS
SAL
SPA
SRO
SRL
GMBH
KK
BV
CP
NL
;
run;
please let me know if it is not clear.
From my point of view YOU need to inspect your data and then actually fully define the rules you need implemented (=clearly and explicitly formulated "black and white" rules and not "could be").
Create representative sample data (have) - best using your real data - and define the outcome. The first test of your rules is then just going over your sample have data (eventually take a 2nd and different sample from your actual data) and determine if the rules would work to get to the desired outcome (a "desk" test without writing any code). Only when satisfied with the "desk" test actual program design and coding should start.
Sooo... I suggest you try and get to the point where you've got fully formulated and tested rules.
If you then want coding help provide representative sample data (have) required to implement and unit test these rules via SAS code.
IF you don't define and verify the rules upfront then what's likely going to happen: You will get SAS code from us which does the things you've told us already for the data you've provided; but then when you use this code for your real data there will always be additional cases and you will continue to ask for changes and other stuff. That's why spending the time for proper analysis and rule formulation will get you faster to the desired outcome.
Just to give you an example for a rule:
1. First occurrence of word CO <or here reference to list of word> immediately followed by any of the following characters <list of characters>: Split string, assign everything after the list of characters to <2nd variable>
Also consider that rules might be hierarchical like: only apply 2nd rule if first rule doesn't apply. You will also have to define this hierarchy.
This should get you ahead.
data HAVE;
input NAME :& $500.;
cards;
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
3M CO "MINNESOTA MINING AND MANUFACTURING CO"
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM
run;
data WANT;
set HAVE;
array SUFFIX[22] $6 _temporary_(
'LTD '
'CO ' 'CO LTD'
'INC '
'PLC '
'SPA '
'LLP '
'LLC '
'LC '
'AB '
'AG '
'SA '
'SAS '
'SAL '
'SPA '
'SRO '
'SRL '
'GMBH '
'KK '
'BV '
'CP '
'NL '
) ;
do I=1 to 26;
STR= '/(.*? '||trim(SUFFIX[I])||'( \(.*\)| ".*"| ''.*'')? )/';
REGEX= prxparse(STR);
call prxsubstr(REGEX,NAME,POS,LEN);
if POS then do;
SUBSTR1=substr(NAME,POS,LEN);
output;
leave;
end;
end;
run;
SUBSTR1 |
---|
3M CO (MINNESOTA MINING AND MANUFACTURING CO) |
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' |
3M CO "MINNESOTA MINING AND MANUFACTURING CO" |
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD |
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO |
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD |
ARCH TIMBER PROTECTION LTD |
Hello @ChrisNZ,
thanks for your amazing codes. it is very helpful for my work. however, the code does not fit the following strings.
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD
which finally get.
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD |
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO |
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD |
How can I only process the first string which in the list?
Could you please give me some suggestions about this?
If you swap the positions of LTD and CO in
array SUFFIX[22] $6 _temporary_(
'LTD '
'CO '
it will work for these companies.
See:
SUBSTR1
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
3M CO "MINNESOTA MINING AND MANUFACTURING CO"
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO
This code was given as a template, not a solution.
You have to modify it to match your data and your needs.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.