Dear all,
I am trying to separate the company name from address/introduction information. In order to get the most accurate result, I expect to separate the NAME variable based on company suffix (which in this example, are 'CO', 'LTD', 'INC', 'CO LTD', 'CO LP', 'CORP'). However, these values include at least one company suffix. So I expect to 1. find the first company suffix that appears, and then 2. find the length of company suffix. 3. if two types of company suffixes have same POSITION number, I expect to use the company suffix which has longer LENGTH. 4.sepate the NAME variable based on that company suffix.
for example, for value,
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE |
there are three kinds of company suffix, which are 'CO', 'CO LP' and ‘LTD’,
I expect to use CALL PRXSUBSTR to get the position and length of these three company suffixes, and then
1.exclude ‘LTD’ because the position of 'LTD' is behind the position of 'CO' and 'CO LP'. and then
3.exclude 'CO' because the length of 'CO' is shorter than the length of 'CO LP'
4. separate the NAME variable based on 'CO LP', namely, get 'AMANA CO LP' as a result.
However, I do not know how to use the code most efficiently. Especially how to get the company suffix I need, and the position and length of it. Could you please give me some suggestion about this?
data HAVE;
input NAME :& $800.;
call prxsubstr(' CO ',NAME,POS_CO,LEN_CO);
call prxsubstr(' LTD ',NAME,POS_LTD,LEN_LTD);
call prxsubstr(' INC ',NAME,POS_INC,LEN_INC);
call prxsubstr(' CO LTD ',NAME,POS_CO_LTD,LEN_CO_LTD);
call prxsubstr(' CO LP ',NAME,POS_CO_LP,LEN_CO_LP);
call prxsubstr(' CORP ',NAME,POS_CORP,LEN_CORP);
/* find the target company suffix and its length */
if find(NAME,' CO ')>0 then do;
if prxmatch('/(.*) CO\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' CO ')+2);
NAME_address=strip(substr(NAME,find(NAME,' CO ')+4, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' CO ')+2);
NAME_address=strip(substr(NAME,find(NAME,' CO ')+4, length(NAME)));
end;
end;
if find(NAME,' LTD ')>0 then do;
if prxmatch('/(.*) LTD\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' LTD ')+3);
NAME_address=strip(substr(NAME,find(NAME,' LTD ')+5, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' LTD ')+3);
NAME_address=strip(substr(NAME,find(NAME,' LTD ')+5, length(NAME)));
end;
end;
if find(NAME,' INC ')>0 then do;
if prxmatch('/(.*) INC\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' INC ')+3);
NAME_address=strip(substr(NAME,find(NAME,' INC ')+5, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' INC ')+3);
NAME_address=strip(substr(NAME,find(NAME,' INC ')+5, length(NAME)));
end;
end;
if find(NAME,' CO LTD ')>0 then do;
if prxmatch('/(.*) CO LTD\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' CO LTD ')+6);
NAME_address=strip(substr(NAME,find(NAME,' CO LTD ')+8, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' CO LTD ')+6);
NAME_address=strip(substr(NAME,find(NAME,' CO LTD ')+8, length(NAME)));
end;
end;
if find(NAME,' CO LP ')>0 then do;
if prxmatch('/(.*) CO LP\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' CO LP ')+5);
NAME_address=strip(substr(NAME,find(NAME,' CO LP ')+7, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' CO LP ')+5);
NAME_address=strip(substr(NAME,find(NAME,' CO LP ')+7, length(NAME)));
end;
end;
if find(NAME,' CORP ')>0 then do;
if prxmatch('/(.*) CORP\s?([(\(.*\))|(\[.*\])|(\{.*\})|(''.*'')|(".*")]+)(.*)/',NAME )=0
and (countc(NAME, '()') ne 1 or countc(NAME, '[]') ne 1
or countc(NAME, '{}') ne 1 or countc(NAME, '"') ne 1 or countc(NAME, "'") ne 1) then do;
NAME_B=substr(NAME,1,find(NAME,' CORP ')+3);
NAME_address=strip(substr(NAME,find(NAME,' CORP ')+5, length(NAME)));
end;
else do;
NAME_B=substr(NAME,1,find(NAME,' CORP ')+3);
NAME_address=strip(substr(NAME,find(NAME,' CORP ')+5, length(NAME)));
end;
end;
cards;
3M INNOVATIVE PROPERTIES CO NIPPON TELEGRAPH & TELEPHONE CORP
A. STUCKI CO A DELAWARE CORP
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE
FEDERAL-HOFFMAN INC D.B.A. FEDERAL CARTRIDGE CO
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY)
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) A LTD IN UK
run;
as a result, I expect to get
NAME | NAME_B | NAME_address |
3M INNOVATIVE PROPERTIES CO NIPPON TELEGRAPH & TELEPHONE CORP | 3M INNOVATIVE PROPERTIES CO | NIPPON TELEGRAPH & TELEPHONE CORP |
A. STUCKI CO A DELAWARE CORP | A. STUCKI CO | A DELAWARE CORP |
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE | AMANA CO LP | A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE |
FEDERAL-HOFFMAN INC D.B.A. FEDERAL CARTRIDGE CO | FEDERAL-HOFFMAN INC | D.B.A. FEDERAL CARTRIDGE CO |
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) | 3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) | |
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY | 3M CO | (MINNESOTA MINING AND MANUFACTURING COMPANY |
3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) A LTD IN UK | 3M CO (MINNESOTA MINING AND MANUFACTURING COMPANY) | A LTD IN UK |
> the ‘smallest’ is the position of the first company suffix
Ok so we are on the same wave length.
1. Use index() instead of
patternID = prxparse('/ CO /');
it's much faster.
2. You need to keep track of which string got you the smallest position and use it.
Basic IF THEN tests on the POS variables is the way to do it.
if POS_CO = SMALLEST then do;
..
else if etc
You have to test for the
if find(NAME,' CO LTD ') then do;
block before the
if find(NAME,' CO ') then do;
block.
Hello @ChrisNZ
Thanks for your suggestion.
However, it is a large and complicated data, for example, the value like,
BPB CO A UK CO LTD
BROADCOM UK CO LTD A DELAWARE CO<code></code>
So, I am trying to check them one by one.
At the current stage, I can find the first company suffix by the following code,
data HAVE;
input NAME :& $800.;
cards;
A. STUCKI CO A DELAWARE CORP
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE
run;
data want;
set HAVE;
patternID = prxparse('/ CO /');
call prxsubstr(patternID,NAME,POS_CO,LEN_CO);
patternID = prxparse('/ LTD /');
call prxsubstr(patternID,NAME,POS_LTD,LEN_LTD);
patternID = prxparse('/ CO LP /');
call prxsubstr(patternID,NAME,POS_CO_LP,LEN_CO_LP);
patternID = prxparse('/ CORP /');
call prxsubstr(patternID,NAME,POS_CORP,LEN_CORP);
array x[*] POS_: ;
do k=1 to patternID;
smallest = smallest(k, of x[*]);
if smallest >0 then output;
end;
drop k patternID;
run;
proc sql;
create table want1 as
select distinct
*
from want
group by NAME
having smallest=min(smallest)
;
quit;
the ‘smallest’ is the position of the first company suffix, however, the position will be the same for 'CO' and 'CO LP' in value 'AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE'
So I need to 1.select the variable which has the longer length (in this example, it is 'CO LP'). and know which suffix is selected.
by using the result of the code, Table 2, as an example
NAME | POS_CO | LEN_CO | POS_LTD | LEN_LTD | POS_CO_LP | LEN_CO_LP | POS_CORP | LEN_CORP | smallest |
A. STUCKI CO A DELAWARE CORP | 10 | 4 | 0 | 0 | 0 | 0 | 24 | 6 | 10 |
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE | 6 | 4 | 14 | 5 | 6 | 7 | 0 | 0 | 6 |
When POS_CO and POS_CO_LP both equal smallest (which is 6), I expect to select CO LP because the number of LEN_CO (which is 4) is larger than the number of LEN_CO_LP(which is 7).
Finally, I expect to get a result,
NAME | POS_CO | LEN_CO | POS_LTD | LEN_LTD | POS_CO_LP | LEN_CO_LP | POS_CORP | LEN_CORP | smallest | largest | suffix |
A. STUCKI CO A DELAWARE CORP | 10 | 4 | 0 | 0 | 0 | 0 | 24 | 6 | 10 | 4 | CO |
AMANA CO LP A LTD PARTNERSHIP UNDER THE LAWS OF USA-DELAWARE | 6 | 4 | 14 | 5 | 6 | 7 | 0 | 0 | 6 | 7 | CO LP |
Could you please give me some suggestion about this?
thanks in advance.
> the ‘smallest’ is the position of the first company suffix
Ok so we are on the same wave length.
1. Use index() instead of
patternID = prxparse('/ CO /');
it's much faster.
2. You need to keep track of which string got you the smallest position and use it.
Basic IF THEN tests on the POS variables is the way to do it.
if POS_CO = SMALLEST then do;
..
else if etc
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.