split several 'company name' variable in new variables based on brackets, quotation marks, etc.

Alexxxxxxx — Sun, 31 Mar 2019 00:25:16 GMT

Dear all,

I have 'company names' dataset. For each company, I have four different company names (i.g., NAME1, NAME2, NAME_SHORT, Previous_name) and expect to process each of them. for each 'company name' variable, I expect to find all strings between (),[],and {} (such as <BR>, [FONT],{BODY},'A',"JUICE") and split them in a new variable (i.e., &COMPANY_NAME._inB, and &COMPANY_NAME._noB).

for example

table a

no	NAME1	NAME2	NAME_SHORT	Previous_name
1	JUICE<BR>apple[footer]	HARDY(FRNS.)'A'	HARDY	HARDY
2	juice <BR> apple	juice (BR) apple	juice (BR) apple	juice (BR) apple
3	juice<BODY> 'apple'	juice(BODY)	juice(BODY)	juice(BODY)
4	<figure> "juice" LTD		"juice" LTD	"juice" LTD

by using 'NAME1' and 'NAME2' as an example, I expect to get

	NAME1	NAME1_inB	NAME1_noB	NAME2	NAME2_inB	NAME2_noB
1	JUICE<BR>apple[footer]	BR	JUICE apple	HARDY(FRNS.)'A'	FRNS.	HARDY
1	JUICE<BR>apple[footer]	footer	JUICE apple	HARDY(FRNS.)'A'	A	HARDY
2	juice <BR> apple	BR	juice apple	juice (BR) apple	BR	juice apple
3	juice<BODY> 'apple'	BODY	juice	juice(BODY)	BODY	juice
3	juice<BODY> 'apple'	apple	juice	juice(BODY)	BODY	juice
4	<figure> "juice" LTD	figure	LTD
4	<figure> "juice" LTD	juice	LTD

data have;
input NAME1 $100. NAME2 $100. NAME_SHORT $100. Previous_name $100.;
infile datalines dlm="," missover;
datalines;
JUICE<BR>apple[footer],  HARDY(FRNS.)'A',  HARDY,  HARDY
juice <BR> apple,  juice (BR) apple, juice (BR) apple, juice (BR) apple 
juice<BODY> 'apple', juice(BODY), juice(BODY), juice(BODY)
<figure> "juice" LTD, ,"juice" LTD, "juice" LTD
;
run;

Edit:

at the current stage, I can find and split the variable by the code,

Data step9.Patstat_gb_hrm_Step3;
Set step9.Patstat_gb_hrm_Step23;
Run;

%MACRO CompnayNameM(no=,Company_name=);

data step9.Patstat_gb_hrm_Step3;
   set step9.Patstat_gb_hrm_Step3;
   RegExID = prxparse('/<\w*>|\[\w*\]|\(\w*\)|\(\w*\)|"\w*"|''\w*''/');
   start=1;
   stop=length(&COMPANY_NAME._Step23);
   call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
      do while (pos > 0);
         &COMPANY_NAME._inB = substr(&COMPANY_NAME._Step23, pos+1, length-2);
         &COMPANY_NAME._noB = prxchange('s/<\w*>|\[\w*\]|\(\w*\)|\(\w*\)|"\w*"|''\w*''/ /', -1, &COMPANY_NAME._Step23);
         call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
      end;
	  drop RegExID pos length start stop;
run;

proc sql;
create table PATSTAT&no. as
select distinct
&COMPANY_NAME.,
&COMPANY_NAME._Step23,
&COMPANY_NAME._inB,
&COMPANY_NAME._noB
from step9.Patstat_gb_hrm_Step3
where &COMPANY_NAME._inB ne ''
;
quit;

%MEND CompnayNameM;

%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
run;

however, the value like,

3M INNOVATIVE PROPERTIES COMPANY (MINNESOTA MINING  &  MANUFACTURING INNOVATIVE PROPERTIES COMPANY)

3RD ANGLE (U.K.)

ABB (ASEA BROWN BOVERI)

cannot be processed. could you give me some suggestions about this?

Re: split several 'company name' variable in new variables based on brackets, quotation marks, etc.

PeterClemmensen — Sun, 31 Mar 2019 07:48:14 GMT

You can apply the same logic as in your previous thread on each variable and merge like this

data have;
input NAME1:$100. NAME2:$100. NAME_SHORT:$100. Previous_name:$100.;
infile datalines dlm="," missover;
datalines;
JUICE<BR>apple[footer],  HARDY(FRNS.)'A',  HARDY,  HARDY
juice <BR> apple,  juice (BR) apple, juice (BR) apple, juice (BR) apple 
juice<BODY> 'apple', juice(BODY), juice(BODY), juice(BODY)
<figure> "juice" LTD, ,"juice" LTD, "juice" LTD
ABB (ASEA BROWN BOVERI)
;
run;

data one;
   format NAME1 NAME1_inB NAME1_noB;
   set have;
   RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
   start=1;
   call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
   NAME1_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME1);
      do while (pos > 0);
         NAME1_inB = substr(NAME1, pos+1, length-2);
         output;
         call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
      end;
   keep NAME1 NAME1_inB NAME1_noB;
run;

data two;
   format NAME2 NAME2_inB NAME2_noB;
   set have;
   RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
   start=1;
   call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
   NAME2_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME2);
      do while (pos > 0);
         NAME2_inB = substr(NAME2, pos+1, length-2);
         output;
         call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
      end;
   keep NAME2 NAME2_inB NAME2_noB;
run;

data want;
   merge one two;
run;

topic Re: split several 'company name' variable in new variables based on brackets, quotation marks, etc. in SAS Programming

split several 'company name' variable in new variables based on brackets, quotation marks, etc.

Re: split several 'company name' variable in new variables based on brackets, quotation marks, etc.