Solved: split strings based on a list of specific strings and stop the process...

Alexxxxxxx · Posted 05-10-2019 11:33 AM

Dear all,

How can I split strings based on a list of specific strings (i.e., LTD, CO, LTD CO) and stop the process when first meeting them.

during the process,

1.I expect not to process brackets( i.e., (), [], {}, ' ', " ") and strings among brackets during the process.

2.split the substrings to a new variable 'NAME_address'

for example,

data HAVE;
  input NAME: &$500.;
cards;
3M CO (MINNESOTA MINING AND MANUFACTURING CO) 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' 
3M CO "MINNESOTA MINING AND MANUFACTURING CO" 


3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO 
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD 
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM 
run;

I expect to get the variable 'NAME_B' and 'NAME_address' from variable 'NAME'

NAME	NAME_B	NAME_address
3M CO (MINNESOTA MINING AND MANUFACTURING CO)	3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'	3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
3M CO "MINNESOTA MINING AND MANUFACTURING CO"	3M CO "MINNESOTA MINING AND MANUFACTURING CO"
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK	3M CO (MINNESOTA MINING AND MANUFACTURING CO)	A LTD IN UK
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO	3M CO 'MINNESOTA MINING AND MANUFACTURING CO'	A UK CO
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD	3M CO "MINNESOTA MINING AND MANUFACTURING CO"	A UK LTD
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM	ARCH TIMBER PROTECTION LTD	A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM

Could you please give me some suggestions about this? thanks in advance.

ChrisNZ · Posted 05-13-2019 01:36 AM

This should get you ahead.

data HAVE;
  input NAME :& $500.;
cards;
3M CO (MINNESOTA MINING AND MANUFACTURING CO) 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' 
3M CO "MINNESOTA MINING AND MANUFACTURING CO" 
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO 
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD 
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM 
run;    

data WANT;
  set HAVE;
  array SUFFIX[22] $6 _temporary_( 
                        'LTD   '
                        'CO    '  'CO LTD'
                        'INC   '
                        'PLC   '
                        'SPA   '
                        'LLP   '
                        'LLC   '
                        'LC    '
                        'AB    '
                        'AG    '
                        'SA    '
                        'SAS   '
                        'SAL   '
                        'SPA   '
                        'SRO   '
                        'SRL   '
                        'GMBH  '
                        'KK    '
                        'BV    '
                        'CP    '
                        'NL    '
                        ) ;    
  do I=1 to 26;
    STR= '/(.*? '||trim(SUFFIX[I])||'( \(.*\)| ".*"| ''.*'')? )/';
    REGEX= prxparse(STR);
    call prxsubstr(REGEX,NAME,POS,LEN);
    if POS then do;
      SUBSTR1=substr(NAME,POS,LEN);
      output; 
      leave;
    end;
  end;
run;

SUBSTR1
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
3M CO "MINNESOTA MINING AND MANUFACTURING CO"
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD
ARCH TIMBER PROTECTION LTD

High-Performance SAS Coding - Third Edition

View solution in original post

Patrick · Posted 05-12-2019 02:20 AM

@Alexxxxxxx

Before you even attempt to code for this you need to formulate the rules (exhaustive).

You show us what you want but I wouldn't know how to come up with a set of rules which also covers your last case not only for exactly your sample data.

Alexxxxxxx · Posted 05-12-2019 05:45 AM

Dear @Patrick

thanks for your reply. I expect to use the code to clean the company name.

so, I would like to separate 'NAME' variable into two new variables, the first is 'NAME_B' variable which only includes strings relevant to the company name, the second variable is 'NAME_address' variable which only includes strings relevant to address.

by using the following company name in table 1 as an example

table1

NAME

3M CO (MINNESOTA MINING AND MANUFACTURING CO)

3M CO 'MINNESOTA MINING AND MANUFACTURING CO'

3M CO "MINNESOTA MINING AND MANUFACTURING CO"

3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK

3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO

3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD

ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM

for value 3M CO (MINNESOTA MINING AND MANUFACTURING CO), '3M CO' is the abbreviation of 'MINNESOTA MINING AND MANUFACTURING CO', and I expect to put them together into 'NAME_B'.

for value 3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO, " 3M CO 'MINNESOTA MINING AND MANUFACTURING CO' " should belong to 'NAME_B' and 'A UK CO' is an introduction of 3M company, which I expect to split into a new variable named 'NAME_address'.

I expect to separate the 'NAME' variable based on 'CO', 'LTD' and 'LTD CO', as they are usually the ending value of a company name.

However, for value 'ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM',

'ARCH TIMBER PROTECTION LTD' is the name of the company, and 'A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM' is the introduction of the company, so I expect to split them into two different variables, which are 'NAME_B', 'NAME_address'. But they both include 'LTD' and 'LTD CO', so I expect to stop the process after getting the first company suffix ( in this example, it is 'LTD').

that is what I expect to do. Could you please give me some suggestions about this, please let me know if you have any question.

thanks in advance.

Patrick · Posted 05-12-2019 07:12 PM

@Alexxxxxxx

You showed us first how the desired result should look like. Now you explain us the reasons why you want the result to look this way. BUT: That still doesn't give us the RULES how to split strings to create such a result.

I assume you're after code which also works for other names and not just exactly for what you've posted.

For example for below:

for value 3M CO (MINNESOTA MINING AND MANUFACTURING CO), '3M CO' is the abbreviation of 'MINNESOTA MINING AND MANUFACTURING CO', and I expect to put them together into 'NAME_B'.

What's the rule for a computer to "know" that this is an abbreviation? Do you have somewhere a list of such abbreviations where we could look up?

I expect to separate the 'NAME' variable based on 'CO', 'LTD' and 'LTD CO', as they are usually the ending value of a company name.

"USUALLY" might give you the generic rule - but then you need clear rules for all the exceptions. Looking at your sample data it also appears that you don't want to just split before LTD or CO but also want to include some more "leading" words.

Based on your sample data a rule could be to split after the first CO immediately followed by a non-letter character. But I guess that would only be appropriate for your sample data and we would still need some exception rules for your last case and for "abbreviations".

So... Try to formulate rules which are exclusive and exhaustive and post them. Then we can help you with the coding.

Alexxxxxxx · Posted 05-12-2019 10:54 PM

dear @Patrick

thanks for your reply.

yes, it is my expectation basically.

a rule could be to split after the first CO immediately followed by a non-letter character.

except for

non-letter character.

I expect not followed by a non-letter character, but these five group characters, i.e., (), [], {}, ' ', " ", and strings among them. such as

<span class="token string"><span class="token punctuation">(</span>MINNESOTA MINING AND MANUFACTURING CO<span class="token punctuation">)<br />[MINNESOTA MINING AND MANUFACTURING CO]<br /></span><span class="token punctuation">{MINNESOTA MINING AND MANUFACTURING CO}</span><br />'MINNESOTA MINING AND MANUFACTURING CO'</span> <br />

rather than

(MINNESOTA MINING AND MANUFACTURING CO

or

(MINNESOTA MINING AND MANUFACTURING CO'

Besides, the list of company suffix except for 'CO' is

data HAVE;
  input NAME: &$500.;
cards;
LTD
CO LTD
INC  
PLC
SPA
LLP
LLC
LC
AB
AG
SA
SAS
SAL
SPA
SRO
SRL
GMBH
KK
BV
CP
NL
;
run;

please let me know if it is not clear.

Patrick · Posted 05-13-2019 01:29 AM

@Alexxxxxxx

From my point of view YOU need to inspect your data and then actually fully define the rules you need implemented (=clearly and explicitly formulated "black and white" rules and not "could be").

Create representative sample data (have) - best using your real data - and define the outcome. The first test of your rules is then just going over your sample have data (eventually take a 2nd and different sample from your actual data) and determine if the rules would work to get to the desired outcome (a "desk" test without writing any code). Only when satisfied with the "desk" test actual program design and coding should start.

Sooo... I suggest you try and get to the point where you've got fully formulated and tested rules.

If you then want coding help provide representative sample data (have) required to implement and unit test these rules via SAS code.

IF you don't define and verify the rules upfront then what's likely going to happen: You will get SAS code from us which does the things you've told us already for the data you've provided; but then when you use this code for your real data there will always be additional cases and you will continue to ask for changes and other stuff. That's why spending the time for proper analysis and rule formulation will get you faster to the desired outcome.

Just to give you an example for a rule:

1. First occurrence of word CO <or here reference to list of word> immediately followed by any of the following characters <list of characters>: Split string, assign everything after the list of characters to <2nd variable>

Also consider that rules might be hierarchical like: only apply 2nd rule if first rule doesn't apply. You will also have to define this hierarchy.

ChrisNZ · Posted 05-13-2019 01:36 AM

This should get you ahead.

data HAVE;
  input NAME :& $500.;
cards;
3M CO (MINNESOTA MINING AND MANUFACTURING CO) 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' 
3M CO "MINNESOTA MINING AND MANUFACTURING CO" 
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO 
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD 
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM 
run;    

data WANT;
  set HAVE;
  array SUFFIX[22] $6 _temporary_( 
                        'LTD   '
                        'CO    '  'CO LTD'
                        'INC   '
                        'PLC   '
                        'SPA   '
                        'LLP   '
                        'LLC   '
                        'LC    '
                        'AB    '
                        'AG    '
                        'SA    '
                        'SAS   '
                        'SAL   '
                        'SPA   '
                        'SRO   '
                        'SRL   '
                        'GMBH  '
                        'KK    '
                        'BV    '
                        'CP    '
                        'NL    '
                        ) ;    
  do I=1 to 26;
    STR= '/(.*? '||trim(SUFFIX[I])||'( \(.*\)| ".*"| ''.*'')? )/';
    REGEX= prxparse(STR);
    call prxsubstr(REGEX,NAME,POS,LEN);
    if POS then do;
      SUBSTR1=substr(NAME,POS,LEN);
      output; 
      leave;
    end;
  end;
run;

SUBSTR1
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
3M CO "MINNESOTA MINING AND MANUFACTURING CO"
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD
ARCH TIMBER PROTECTION LTD

High-Performance SAS Coding - Third Edition

Alexxxxxxx · Posted 05-19-2019 08:39 PM

Hello @ChrisNZ,

thanks for your amazing codes. it is very helpful for my work. however, the code does not fit the following strings.

3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO 
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD

which finally get.

3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD

3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO

3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD

How can I only process the first string which in the list?

Could you please give me some suggestions about this?

ChrisNZ · Posted 05-19-2019 10:47 PM

If you swap the positions of LTD and CO in

 array SUFFIX[22] $6 _temporary_( 
                        'LTD   '
                        'CO    '

it will work for these companies.

See:

SUBSTR1
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
3M CO "MINNESOTA MINING AND MANUFACTURING CO"
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO

This code was given as a template, not a solution.

You have to modify it to match your data and your needs.

High-Performance SAS Coding - Third Edition

split strings based on a list of specific strings and stop the process when first meeting them

Re: split strings based on a list of specific strings and stop the process when first meeting them

Re: split strings based on a list of specific strings and stop the process when first meeting them

Re: split strings based on a list of specific strings and stop the process when first meeting them

Re: split strings based on a list of specific strings and stop the process when first meeting them

Re: split strings based on a list of specific strings and stop the process when first meeting them

Re: split strings based on a list of specific strings and stop the process when first meeting them

Re: split strings based on a list of specific strings and stop the process when first meeting them

Re: split strings based on a list of specific strings and stop the process when first meeting them

Re: split strings based on a list of specific strings and stop the process when first meeting them

SAS Innovate 2025: Call for Content

Classroom Training Available!