BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Alexxxxxxx
Pyrite | Level 9

Dear all,

 

How can I split strings based on a list of specific strings (i.e., LTD, CO, LTD CO) and stop the process when first meeting them.

during the process,

1.I expect not to process brackets( i.e., (), [], {}, ' ', " ") and strings among brackets during the process.

2.split the substrings to a new variable 'NAME_address'

 

for example,

 

data HAVE;
  input NAME: &$500.;
cards;
3M CO (MINNESOTA MINING AND MANUFACTURING CO) 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' 
3M CO "MINNESOTA MINING AND MANUFACTURING CO" 


3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK 3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO 3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM run; 

I expect to get the variable 'NAME_B' and 'NAME_address' from variable 'NAME'

NAMENAME_BNAME_address
3M CO (MINNESOTA MINING AND MANUFACTURING CO)3M CO (MINNESOTA MINING AND MANUFACTURING CO) 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'3M CO 'MINNESOTA MINING AND MANUFACTURING CO' 
3M CO "MINNESOTA MINING AND MANUFACTURING CO"3M CO "MINNESOTA MINING AND MANUFACTURING CO" 
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK3M CO (MINNESOTA MINING AND MANUFACTURING CO)A LTD IN UK
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO3M CO 'MINNESOTA MINING AND MANUFACTURING CO'A UK CO
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD3M CO "MINNESOTA MINING AND MANUFACTURING CO"A UK LTD
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOMARCH TIMBER PROTECTION LTDA PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM

 

 

Could you please give me some suggestions about this? thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

This should get you ahead.

data HAVE;
  input NAME :& $500.;
cards;
3M CO (MINNESOTA MINING AND MANUFACTURING CO) 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' 
3M CO "MINNESOTA MINING AND MANUFACTURING CO" 
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO 
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD 
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM 
run;    

data WANT;
  set HAVE;
  array SUFFIX[22] $6 _temporary_( 
                        'LTD   '
                        'CO    '  'CO LTD'
                        'INC   '
                        'PLC   '
                        'SPA   '
                        'LLP   '
                        'LLC   '
                        'LC    '
                        'AB    '
                        'AG    '
                        'SA    '
                        'SAS   '
                        'SAL   '
                        'SPA   '
                        'SRO   '
                        'SRL   '
                        'GMBH  '
                        'KK    '
                        'BV    '
                        'CP    '
                        'NL    '
                        ) ;    
  do I=1 to 26;
    STR= '/(.*? '||trim(SUFFIX[I])||'( \(.*\)| ".*"| ''.*'')? )/';
    REGEX= prxparse(STR);
    call prxsubstr(REGEX,NAME,POS,LEN);
    if POS then do;
      SUBSTR1=substr(NAME,POS,LEN);
      output; 
      leave;
    end;
  end;
run;   
SUBSTR1
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
3M CO "MINNESOTA MINING AND MANUFACTURING CO"
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD
ARCH TIMBER PROTECTION LTD 

 

View solution in original post

8 REPLIES 8
Patrick
Opal | Level 21

@Alexxxxxxx 

Before you even attempt to code for this you need to formulate the rules (exhaustive).

You show us what you want but I wouldn't know how to come up with a set of rules which also covers your last case not only for exactly your sample data.

Alexxxxxxx
Pyrite | Level 9

Dear @Patrick 

 

thanks for your reply. I expect to use the code to clean the company name.

 

so, I would like to separate 'NAME' variable into two new variables, the first is 'NAME_B' variable which only includes strings relevant to the company name, the second variable is 'NAME_address' variable which only includes strings relevant to address.

 

by using the following company name in table 1 as an example 

 

table1
NAME
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
3M CO "MINNESOTA MINING AND MANUFACTURING CO"
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM

 

for value 3M CO (MINNESOTA MINING AND MANUFACTURING CO), '3M CO' is the abbreviation of 'MINNESOTA MINING AND MANUFACTURING CO', and I expect to put them together into 'NAME_B'.

 

for value 3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO,  " 3M CO 'MINNESOTA MINING AND MANUFACTURING CO' " should belong to 'NAME_B' and 'A UK CO' is an introduction of 3M company, which I expect to split into a new variable named 'NAME_address'.

 

I expect to separate the 'NAME' variable based on 'CO', 'LTD' and 'LTD CO', as they are usually the ending value of a company name.

 

However, for value 'ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM',

'ARCH TIMBER PROTECTION LTD' is the name of the company, and 'A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM' is the introduction of the company, so I expect to split them into two different variables, which are 'NAME_B', 'NAME_address'. But they both include 'LTD' and 'LTD CO', so I expect to stop the process after getting the first company suffix ( in this example, it is 'LTD').

 

that is what I expect to do. Could you please give me some suggestions about this, please let me know if you have any question.

 

thanks in advance.

Patrick
Opal | Level 21

@Alexxxxxxx 

You showed us first how the desired result should look like. Now you explain us the reasons why you want the result to look this way. BUT: That still doesn't give us the RULES how to split strings to create such a result.

I assume you're after code which also works for other names and not just exactly for what you've posted.

 

For example for below:

for value 3M CO (MINNESOTA MINING AND MANUFACTURING CO), '3M CO' is the abbreviation of 'MINNESOTA MINING AND MANUFACTURING CO', and I expect to put them together into 'NAME_B'.

What's the rule for a computer to "know" that this is an abbreviation? Do you have somewhere a list of such abbreviations where we could look up?

 

I expect to separate the 'NAME' variable based on 'CO', 'LTD' and 'LTD CO', as they are usually the ending value of a company name.

"USUALLY" might give you the generic rule - but then you need clear rules for all the exceptions. Looking at your sample data it also appears that you don't want to just split before LTD or CO but also want to include some more "leading" words. 

Based on your sample data a rule could be to split after the first CO immediately followed by a non-letter character. But I guess that would only be appropriate for your sample data and we would still need some exception rules for your last case and for "abbreviations".

 

So... Try to formulate rules which are exclusive and exhaustive and post them. Then we can help you with the coding.

 

Alexxxxxxx
Pyrite | Level 9

dear @Patrick 

 

thanks for your reply.

 

yes,  it is my expectation basically.

a rule could be to split after the first CO immediately followed by a non-letter character.

except for

non-letter character.

I expect not followed by a non-letter character, but these five group characters, i.e., (), [], {}, ' ', " ",  and strings among them. such as 

<span class="token string"><span class="token punctuation">(</span>MINNESOTA MINING AND MANUFACTURING CO<span class="token punctuation">)<br />[MINNESOTA MINING AND MANUFACTURING CO]<br /></span><span class="token punctuation">{MINNESOTA MINING AND MANUFACTURING CO}</span><br />'MINNESOTA MINING AND MANUFACTURING CO'</span> <br />

rather than 

(MINNESOTA MINING AND MANUFACTURING CO

or 

(MINNESOTA MINING AND MANUFACTURING CO'

Besides, the list of company suffix except for 'CO' is 

 

data HAVE;
  input NAME: &$500.;
cards;
LTD
CO LTD
INC  
PLC
SPA
LLP
LLC
LC
AB
AG
SA
SAS
SAL
SPA
SRO
SRL
GMBH
KK
BV
CP
NL
;
run;

please let me know if it is not clear.

 

 

Patrick
Opal | Level 21

@Alexxxxxxx 

From my point of view YOU need to inspect your data and then actually fully define the rules you need implemented (=clearly and explicitly formulated "black and white" rules and not "could be").

Create representative sample data (have) - best using your real data - and define the outcome. The first test of your rules is then just going over your sample have data (eventually take a 2nd and different sample from your actual data) and determine if the rules would work to get to the desired outcome (a "desk" test without writing any code). Only when satisfied with the "desk" test actual program design and coding should start.

Sooo... I suggest you try and get to the point where you've got fully formulated and tested rules. 

If you then want coding help provide representative sample data (have) required to implement and unit test these rules via SAS code. 

 

IF you don't define and verify the rules upfront then what's likely going to happen: You will get SAS code from us which does the things you've told us already for the data you've provided; but then when you use this code for your real data there will always be additional cases and you will continue to ask for changes and other stuff. That's why spending the time for proper analysis and rule formulation will get you faster to the desired outcome.

 

Just to give you an example for a rule:

1. First occurrence of word CO <or here reference to list of word> immediately followed by any of the following characters <list of characters>: Split string, assign everything after the list of characters to <2nd variable>

 

Also consider that rules might be hierarchical like: only apply 2nd rule if first rule doesn't apply. You will also have to define this hierarchy.

ChrisNZ
Tourmaline | Level 20

This should get you ahead.

data HAVE;
  input NAME :& $500.;
cards;
3M CO (MINNESOTA MINING AND MANUFACTURING CO) 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' 
3M CO "MINNESOTA MINING AND MANUFACTURING CO" 
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO 
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD 
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO ORGANISED UNDER THE LAWS OF THE UNITED KINDGOM 
run;    

data WANT;
  set HAVE;
  array SUFFIX[22] $6 _temporary_( 
                        'LTD   '
                        'CO    '  'CO LTD'
                        'INC   '
                        'PLC   '
                        'SPA   '
                        'LLP   '
                        'LLC   '
                        'LC    '
                        'AB    '
                        'AG    '
                        'SA    '
                        'SAS   '
                        'SAL   '
                        'SPA   '
                        'SRO   '
                        'SRL   '
                        'GMBH  '
                        'KK    '
                        'BV    '
                        'CP    '
                        'NL    '
                        ) ;    
  do I=1 to 26;
    STR= '/(.*? '||trim(SUFFIX[I])||'( \(.*\)| ".*"| ''.*'')? )/';
    REGEX= prxparse(STR);
    call prxsubstr(REGEX,NAME,POS,LEN);
    if POS then do;
      SUBSTR1=substr(NAME,POS,LEN);
      output; 
      leave;
    end;
  end;
run;   
SUBSTR1
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
3M CO "MINNESOTA MINING AND MANUFACTURING CO"
3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD
ARCH TIMBER PROTECTION LTD 

 

Alexxxxxxx
Pyrite | Level 9

Hello @ChrisNZ,

 

thanks for your amazing codes. it is very helpful for my work. however, the code does not fit the following strings.

 

3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD IN UK 
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO 
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD 

which finally get.

3M CO (MINNESOTA MINING AND MANUFACTURING CO) A LTD
3M CO 'MINNESOTA MINING AND MANUFACTURING CO' A UK CO
3M CO "MINNESOTA MINING AND MANUFACTURING CO" A UK LTD

How can I only process the first string which in the list?

Could you please give me some suggestions about this?

 

 

ChrisNZ
Tourmaline | Level 20

If you swap the positions of  LTD and CO in

 array SUFFIX[22] $6 _temporary_( 
                        'LTD   '
                        'CO    ' 

it will work for these companies.

See:

SUBSTR1
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
3M CO "MINNESOTA MINING AND MANUFACTURING CO"
3M CO (MINNESOTA MINING AND MANUFACTURING CO)
3M CO 'MINNESOTA MINING AND MANUFACTURING CO'
ARCH TIMBER PROTECTION LTD A PRIVATE LTD CO

 

This code was given as a template, not a solution.

You have to modify it to match your data and your needs.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1097 views
  • 1 like
  • 3 in conversation