BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Alexxxxxxx
Pyrite | Level 9

Dear all,

 

I am trying to separate the company name from address/introduction information. In order to get the most accurate result, I expect to separate the NAME variable based on company suffix (which in this example, are 'CO', 'LTD', 'INC', 'CO LTD', 'CO LP', 'CORP'). However, these values include at least one company suffix. So I expect to 1. find the first company suffix that appears, and then 2. find the length of company suffix. 3. if two types of company suffixes have same POSITION number, I expect to use the company suffix which has longer LENGTH. 4.sepate the NAME variable based on that company suffix.

 

for example, for value,

AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE

there are three kinds of company suffix, which are 'CO', 'CO LP' and ‘LTD’,

I expect to use CALL PRXSUBSTR to get the position and length of these three company suffixes, and then 

1.exclude ‘LTD’ because the position of 'LTD' is behind the position of 'CO' and 'CO LP'. and then

3.exclude 'CO' because the length of 'CO' is shorter than the length of 'CO LP'

4. separate the NAME variable based on 'CO LP', namely, get 'AMANA CO LP' as a result.

 

However, I do not know how to use the code most efficiently. Especially how to get the company suffix I need, and the position and length of it. Could you please give me some suggestion about this?

 

 

data HAVE;
input NAME :& $800.;
call prxsubstr(' CO ',NAME,POS_CO,LEN_CO);
call prxsubstr(' LTD ',NAME,POS_LTD,LEN_LTD);
call prxsubstr(' INC ',NAME,POS_INC,LEN_INC);
call prxsubstr(' CO LTD ',NAME,POS_CO_LTD,LEN_CO_LTD);
call prxsubstr(' CO LP ',NAME,POS_CO_LP,LEN_CO_LP);
call prxsubstr(' CORP ',NAME,POS_CORP,LEN_CORP);
/* find the target company suffix and its length */
if find(NAME,' CO ')>0 then do;
if prxmatch('/(.*) CO\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' CO ')+2);
NAME_address=strip(substr(NAME,find(NAME,' CO ')+4, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' CO ')+2);
NAME_address=strip(substr(NAME,find(NAME,' CO ')+4, length(NAME)));
end;
end;
if find(NAME,' LTD ')>0 then do;
if prxmatch('/(.*) LTD\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' LTD ')+3);
NAME_address=strip(substr(NAME,find(NAME,' LTD ')+5, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' LTD ')+3);
NAME_address=strip(substr(NAME,find(NAME,' LTD ')+5, length(NAME)));
end;
end;
if find(NAME,' INC ')>0 then do;
if prxmatch('/(.*) INC\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' INC ')+3);
NAME_address=strip(substr(NAME,find(NAME,' INC ')+5, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' INC ')+3);
NAME_address=strip(substr(NAME,find(NAME,' INC ')+5, length(NAME)));
end;
end;
if find(NAME,' CO LTD ')>0 then do;
if prxmatch('/(.*) CO LTD\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' CO LTD ')+6);
NAME_address=strip(substr(NAME,find(NAME,' CO LTD ')+8, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' CO LTD ')+6);
NAME_address=strip(substr(NAME,find(NAME,' CO LTD ')+8, length(NAME)));
end;
end;
if find(NAME,' CO LP ')>0 then do;
if prxmatch('/(.*) CO LP\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' CO LP ')+5);
NAME_address=strip(substr(NAME,find(NAME,' CO LP ')+7, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' CO LP ')+5);
NAME_address=strip(substr(NAME,find(NAME,' CO LP ')+7, length(NAME)));
end;
end;
if find(NAME,' CORP ')>0 then do;
if prxmatch('/(.*) CORP\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' CORP ')+3);
NAME_address=strip(substr(NAME,find(NAME,' CORP ')+5, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' CORP ')+3);
NAME_address=strip(substr(NAME,find(NAME,' CORP ')+5, length(NAME)));
end;
end;

cards;
3M INNOVATIVE PROPERTIES CO NIPPON TELEGRAPH & TELEPHONE CORP
A. STUCKI CO A DELAWARE CORP
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE
FEDERAL-HOFFMAN INC D.B.A. FEDERAL CARTRIDGE CO
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) A LTD IN UK
run;

as a result, I expect to get 

NAMENAME_BNAME_address
3M INNOVATIVE PROPERTIES CO NIPPON TELEGRAPH & TELEPHONE CORP3M INNOVATIVE PROPERTIES CONIPPON TELEGRAPH & TELEPHONE CORP
A. STUCKI CO A DELAWARE CORPA. STUCKI CO A DELAWARE CORP
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWAREAMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE
FEDERAL-HOFFMAN INC D.B.A. FEDERAL CARTRIDGE COFEDERAL-HOFFMAN INCD.B.A. FEDERAL CARTRIDGE CO
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) 
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) A LTD IN UK3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)A LTD IN UK
1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

> the ‘smallest’ is the position of the first company suffix

 

Ok so we are on the same wave length.

 

1. Use index() instead of 

patternID = prxparse('/ CO /');

it's much faster.

 

2. You need to keep track of which string got you the smallest position and use it.

Basic IF THEN tests on the POS variables is the way to do it.

if POS_CO = SMALLEST then do;

..

else if etc

View solution in original post

3 REPLIES 3
ChrisNZ
Tourmaline | Level 20

You have to test for the 

if find(NAME,' CO LTD ') then do;

block before the 

if find(NAME,' CO ') then do;

block.

Alexxxxxxx
Pyrite | Level 9

Hello @ChrisNZ 

 

Thanks for your suggestion.

 

However, it is a large and complicated data, for example, the value like, 

 

BPB CO A UK CO LTD
BROADCOM UK CO LTD A DELAWARE CO<code></code>

 

 So, I am trying to check them one by one.

At the current stage, I can find the first company suffix by the following code,

 

data HAVE;
input NAME :& $800.;
cards;
A. STUCKI CO A DELAWARE CORP
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE
run; 


data want;
set HAVE;
patternID = prxparse('/ CO /');
call prxsubstr(patternID,NAME,POS_CO,LEN_CO);
patternID = prxparse('/ LTD /');
call prxsubstr(patternID,NAME,POS_LTD,LEN_LTD);
patternID = prxparse('/ CO LP /');
call prxsubstr(patternID,NAME,POS_CO_LP,LEN_CO_LP);
patternID = prxparse('/ CORP /');
call prxsubstr(patternID,NAME,POS_CORP,LEN_CORP);

array x[*] POS_: ;
do k=1 to patternID;
smallest = smallest(k, of x[*]);
if smallest >0 then output;
end;
drop k patternID;

run;


proc sql;
create table want1 as
select distinct 
*
from want
group by NAME
having smallest=min(smallest)
;
quit;

 

the ‘smallest’ is the position of the first company suffix, however, the position will be the same for 'CO' and 'CO LP' in value 'AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE'

 

 

So I need to 1.select the variable which has the longer length (in this example, it is 'CO LP'). and know which suffix is selected.

 

by using the result of the code, Table 2, as an example 

NAMEPOS_COLEN_COPOS_LTDLEN_LTDPOS_CO_LPLEN_CO_LPPOS_CORPLEN_CORPsmallest
A. STUCKI CO A DELAWARE CORP104000024610
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE6414567006

 

When POS_CO and POS_CO_LP both equal smallest (which is 6), I expect to select CO LP because the number of LEN_CO (which is 4) is larger than the number of LEN_CO_LP(which is 7). 

 

Finally, I expect to get a result,

NAMEPOS_COLEN_COPOS_LTDLEN_LTDPOS_CO_LPLEN_CO_LPPOS_CORPLEN_CORPsmallestlargestsuffix
A. STUCKI CO A DELAWARE CORP1040000246104CO
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE64145670067CO LP

 

 

Could you please give me some suggestion about this?

thanks in advance.

ChrisNZ
Tourmaline | Level 20

> the ‘smallest’ is the position of the first company suffix

 

Ok so we are on the same wave length.

 

1. Use index() instead of 

patternID = prxparse('/ CO /');

it's much faster.

 

2. You need to keep track of which string got you the smallest position and use it.

Basic IF THEN tests on the POS variables is the way to do it.

if POS_CO = SMALLEST then do;

..

else if etc

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 444 views
  • 0 likes
  • 2 in conversation