Solved: Re: How to find the target string based on position and length?

Alexxxxxxx · Posted 06-04-2019 04:01 AM

Dear all,

I am trying to separate the company name from address/introduction information. In order to get the most accurate result, I expect to separate the NAME variable based on company suffix (which in this example, are 'CO', 'LTD', 'INC', 'CO LTD', 'CO LP', 'CORP'). However, these values include at least one company suffix. So I expect to 1. find the first company suffix that appears, and then 2. find the length of company suffix. 3. if two types of company suffixes have same POSITION number, I expect to use the company suffix which has longer LENGTH. 4.sepate the NAME variable based on that company suffix.

for example, for value,

AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE

there are three kinds of company suffix, which are 'CO', 'CO LP' and ‘LTD’,

I expect to use CALL PRXSUBSTR to get the position and length of these three company suffixes, and then

1.exclude ‘LTD’ because the position of 'LTD' is behind the position of 'CO' and 'CO LP'. and then

3.exclude 'CO' because the length of 'CO' is shorter than the length of 'CO LP'

4. separate the NAME variable based on 'CO LP', namely, get 'AMANA CO LP' as a result.

However, I do not know how to use the code most efficiently. Especially how to get the company suffix I need, and the position and length of it. Could you please give me some suggestion about this?

data HAVE;
input NAME :& $800.;
call prxsubstr(' CO ',NAME,POS_CO,LEN_CO);
call prxsubstr(' LTD ',NAME,POS_LTD,LEN_LTD);
call prxsubstr(' INC ',NAME,POS_INC,LEN_INC);
call prxsubstr(' CO LTD ',NAME,POS_CO_LTD,LEN_CO_LTD);
call prxsubstr(' CO LP ',NAME,POS_CO_LP,LEN_CO_LP);
call prxsubstr(' CORP ',NAME,POS_CORP,LEN_CORP);
 /* find the target company suffix and its length */
if find(NAME,' CO ')>0 then do;
if prxmatch('/(.*) CO\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' CO ')+2);
NAME_address=strip(substr(NAME,find(NAME,' CO ')+4, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' CO ')+2);
NAME_address=strip(substr(NAME,find(NAME,' CO ')+4, length(NAME)));
end;
end;
if find(NAME,' LTD ')>0 then do;
if prxmatch('/(.*) LTD\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' LTD ')+3);
NAME_address=strip(substr(NAME,find(NAME,' LTD ')+5, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' LTD ')+3);
NAME_address=strip(substr(NAME,find(NAME,' LTD ')+5, length(NAME)));
end;
end;
if find(NAME,' INC ')>0 then do;
if prxmatch('/(.*) INC\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' INC ')+3);
NAME_address=strip(substr(NAME,find(NAME,' INC ')+5, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' INC ')+3);
NAME_address=strip(substr(NAME,find(NAME,' INC ')+5, length(NAME)));
end;
end;
if find(NAME,' CO LTD ')>0 then do;
if prxmatch('/(.*) CO LTD\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' CO LTD ')+6);
NAME_address=strip(substr(NAME,find(NAME,' CO LTD ')+8, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' CO LTD ')+6);
NAME_address=strip(substr(NAME,find(NAME,' CO LTD ')+8, length(NAME)));
end;
end; 
if find(NAME,' CO LP ')>0 then do;
if prxmatch('/(.*) CO LP\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' CO LP ')+5);
NAME_address=strip(substr(NAME,find(NAME,' CO LP ')+7, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' CO LP ')+5);
NAME_address=strip(substr(NAME,find(NAME,' CO LP ')+7, length(NAME)));
end;
end;
if find(NAME,' CORP ')>0 then do;
if prxmatch('/(.*) CORP\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' CORP ')+3);
NAME_address=strip(substr(NAME,find(NAME,' CORP ')+5, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' CORP ')+3);
NAME_address=strip(substr(NAME,find(NAME,' CORP ')+5, length(NAME)));
end;
end;

cards;
3M INNOVATIVE PROPERTIES CO NIPPON TELEGRAPH & TELEPHONE CORP 
A. STUCKI CO A DELAWARE CORP
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE
FEDERAL-HOFFMAN INC D.B.A. FEDERAL CARTRIDGE CO
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) 
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY 
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) A LTD IN UK 
run;

as a result, I expect to get

NAME	NAME_B	NAME_address
3M INNOVATIVE PROPERTIES CO NIPPON TELEGRAPH & TELEPHONE CORP	3M INNOVATIVE PROPERTIES CO	NIPPON TELEGRAPH & TELEPHONE CORP
A. STUCKI CO A DELAWARE CORP	A. STUCKI CO	A DELAWARE CORP
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE	AMANA CO LP	A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE
FEDERAL-HOFFMAN INC D.B.A. FEDERAL CARTRIDGE CO	FEDERAL-HOFFMAN INC	D.B.A. FEDERAL CARTRIDGE CO
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)	3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY	3M CO	(MINNESOTA MINING AND MANUFACTURING COMPANY
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) A LTD IN UK	3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)	A LTD IN UK

ChrisNZ · Posted 06-05-2019 01:00 AM

> the ‘smallest’ is the position of the first company suffix

Ok so we are on the same wave length.

1. Use index() instead of

patternID = prxparse('/ CO /');

it's much faster.

2. You need to keep track of which string got you the smallest position and use it.

Basic IF THEN tests on the POS variables is the way to do it.

if POS_CO = SMALLEST then do;

..

else if etc

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 06-04-2019 11:19 PM

You have to test for the

if find(NAME,' CO LTD ') then do;

block before the

if find(NAME,' CO ') then do;

block.

High-Performance SAS Coding - Third Edition

Alexxxxxxx · Posted 06-05-2019 12:24 AM

Hello @ChrisNZ

Thanks for your suggestion.

However, it is a large and complicated data, for example, the value like,

BPB CO A UK CO LTD
BROADCOM UK CO LTD A DELAWARE CO<code></code>

So, I am trying to check them one by one.

At the current stage, I can find the first company suffix by the following code,

data HAVE;
input NAME :& $800.;
cards;
A. STUCKI CO A DELAWARE CORP
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE
run; 


data want;
set HAVE;
patternID = prxparse('/ CO /');
call prxsubstr(patternID,NAME,POS_CO,LEN_CO);
patternID = prxparse('/ LTD /');
call prxsubstr(patternID,NAME,POS_LTD,LEN_LTD);
patternID = prxparse('/ CO LP /');
call prxsubstr(patternID,NAME,POS_CO_LP,LEN_CO_LP);
patternID = prxparse('/ CORP /');
call prxsubstr(patternID,NAME,POS_CORP,LEN_CORP);

array x[*] POS_: ;
do k=1 to patternID;
smallest = smallest(k, of x[*]);
if smallest >0 then output;
end;
drop k patternID;

run;


proc sql;
create table want1 as
select distinct 
*
from want
group by NAME
having smallest=min(smallest)
;
quit;

the ‘smallest’ is the position of the first company suffix, however, the position will be the same for 'CO' and 'CO LP' in value 'AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE'

So I need to 1.select the variable which has the longer length (in this example, it is 'CO LP'). and know which suffix is selected.

by using the result of the code, Table 2, as an example

NAME	POS_CO	LEN_CO	POS_LTD	LEN_LTD	POS_CO_LP	LEN_CO_LP	POS_CORP	LEN_CORP	smallest
A. STUCKI CO A DELAWARE CORP	10	4	0	0	0	0	24	6	10
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE	6	4	14	5	6	7	0	0	6

When POS_CO and POS_CO_LP both equal smallest (which is 6), I expect to select CO LP because the number of LEN_CO (which is 4) is larger than the number of LEN_CO_LP(which is 7).

Finally, I expect to get a result,

NAME	POS_CO	LEN_CO	POS_LTD	LEN_LTD	POS_CO_LP	LEN_CO_LP	POS_CORP	LEN_CORP	smallest	largest	suffix
A. STUCKI CO A DELAWARE CORP	10	4	0	0	0	0	24	6	10	4	CO
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE	6	4	14	5	6	7	0	0	6	7	CO LP

Could you please give me some suggestion about this?

thanks in advance.

ChrisNZ · Posted 06-05-2019 01:00 AM

> the ‘smallest’ is the position of the first company suffix

Ok so we are on the same wave length.

1. Use index() instead of

patternID = prxparse('/ CO /');

it's much faster.

2. You need to keep track of which string got you the smallest position and use it.

Basic IF THEN tests on the POS variables is the way to do it.

if POS_CO = SMALLEST then do;

..

else if etc

High-Performance SAS Coding - Third Edition

How to find the target string based on position and length?

Re: How to find the target string based on position and length?

Re: How to find the target string based on position and length?

Re: How to find the target string based on position and length?

Re: How to find the target string based on position and length?

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!