Dear all,
I have 'company names' dataset. For each company, I have four different company names (i.g., NAME1, NAME2, NAME_SHORT, Previous_name) and expect to process each of them. for each 'company name' variable, I expect to find all strings between (),[],and {} (such as <BR>, [FONT],{BODY},'A',"JUICE") and split them in a new variable (i.e., &COMPANY_NAME._inB, and &COMPANY_NAME._noB).
for example
table a
no | NAME1 | NAME2 | NAME_SHORT | Previous_name |
1 | JUICE<BR>apple[footer] | HARDY(FRNS.)'A' | HARDY | HARDY |
2 | juice <BR> apple | juice (BR) apple | juice (BR) apple | juice (BR) apple |
3 | juice<BODY> 'apple' | juice(BODY) | juice(BODY) | juice(BODY) |
4 | <figure> "juice" LTD | "juice" LTD | "juice" LTD |
by using 'NAME1' and 'NAME2' as an example, I expect to get
NAME1 | NAME1_inB | NAME1_noB | NAME2 | NAME2_inB | NAME2_noB | |
1 | JUICE<BR>apple[footer] | BR | JUICE apple | HARDY(FRNS.)'A' | FRNS. | HARDY |
1 | JUICE<BR>apple[footer] | footer | JUICE apple | HARDY(FRNS.)'A' | A | HARDY |
2 | juice <BR> apple | BR | juice apple | juice (BR) apple | BR | juice apple |
3 | juice<BODY> 'apple' | BODY | juice | juice(BODY) | BODY | juice |
3 | juice<BODY> 'apple' | apple | juice | juice(BODY) | BODY | juice |
4 | <figure> "juice" LTD | figure | LTD | |||
4 | <figure> "juice" LTD | juice | LTD |
data have;
input NAME1 $100. NAME2 $100. NAME_SHORT $100. Previous_name $100.;
infile datalines dlm="," missover;
datalines;
JUICE<BR>apple[footer], HARDY(FRNS.)'A', HARDY, HARDY
juice <BR> apple, juice (BR) apple, juice (BR) apple, juice (BR) apple
juice<BODY> 'apple', juice(BODY), juice(BODY), juice(BODY)
<figure> "juice" LTD, ,"juice" LTD, "juice" LTD
;
run;
Edit:
at the current stage, I can find and split the variable by the code,
Data step9.Patstat_gb_hrm_Step3;
Set step9.Patstat_gb_hrm_Step23;
Run;
%MACRO CompnayNameM(no=,Company_name=);
data step9.Patstat_gb_hrm_Step3;
set step9.Patstat_gb_hrm_Step3;
RegExID = prxparse('/<\w*>|\[\w*\]|\(\w*\)|\(\w*\)|"\w*"|''\w*''/');
start=1;
stop=length(&COMPANY_NAME._Step23);
call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
do while (pos > 0);
&COMPANY_NAME._inB = substr(&COMPANY_NAME._Step23, pos+1, length-2);
&COMPANY_NAME._noB = prxchange('s/<\w*>|\[\w*\]|\(\w*\)|\(\w*\)|"\w*"|''\w*''/ /', -1, &COMPANY_NAME._Step23);
call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
end;
drop RegExID pos length start stop;
run;
proc sql;
create table PATSTAT&no. as
select distinct
&COMPANY_NAME.,
&COMPANY_NAME._Step23,
&COMPANY_NAME._inB,
&COMPANY_NAME._noB
from step9.Patstat_gb_hrm_Step3
where &COMPANY_NAME._inB ne ''
;
quit;
%MEND CompnayNameM;
%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
run;
however, the value like,
3M INNOVATIVE PROPERTIES COMPANY (MINNESOTA MINING & MANUFACTURING INNOVATIVE PROPERTIES COMPANY)
or
3RD ANGLE (U.K.)
or
ABB (ASEA BROWN BOVERI)
cannot be processed. could you give me some suggestions about this?
You can apply the same logic as in your previous thread on each variable and merge like this
data have;
input NAME1:$100. NAME2:$100. NAME_SHORT:$100. Previous_name:$100.;
infile datalines dlm="," missover;
datalines;
JUICE<BR>apple[footer], HARDY(FRNS.)'A', HARDY, HARDY
juice <BR> apple, juice (BR) apple, juice (BR) apple, juice (BR) apple
juice<BODY> 'apple', juice(BODY), juice(BODY), juice(BODY)
<figure> "juice" LTD, ,"juice" LTD, "juice" LTD
ABB (ASEA BROWN BOVERI)
;
run;
data one;
format NAME1 NAME1_inB NAME1_noB;
set have;
RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
start=1;
call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
NAME1_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME1);
do while (pos > 0);
NAME1_inB = substr(NAME1, pos+1, length-2);
output;
call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
end;
keep NAME1 NAME1_inB NAME1_noB;
run;
data two;
format NAME2 NAME2_inB NAME2_noB;
set have;
RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
start=1;
call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
NAME2_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME2);
do while (pos > 0);
NAME2_inB = substr(NAME2, pos+1, length-2);
output;
call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
end;
keep NAME2 NAME2_inB NAME2_noB;
run;
data want;
merge one two;
run;
You can apply the same logic as in your previous thread on each variable and merge like this
data have;
input NAME1:$100. NAME2:$100. NAME_SHORT:$100. Previous_name:$100.;
infile datalines dlm="," missover;
datalines;
JUICE<BR>apple[footer], HARDY(FRNS.)'A', HARDY, HARDY
juice <BR> apple, juice (BR) apple, juice (BR) apple, juice (BR) apple
juice<BODY> 'apple', juice(BODY), juice(BODY), juice(BODY)
<figure> "juice" LTD, ,"juice" LTD, "juice" LTD
ABB (ASEA BROWN BOVERI)
;
run;
data one;
format NAME1 NAME1_inB NAME1_noB;
set have;
RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
start=1;
call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
NAME1_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME1);
do while (pos > 0);
NAME1_inB = substr(NAME1, pos+1, length-2);
output;
call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
end;
keep NAME1 NAME1_inB NAME1_noB;
run;
data two;
format NAME2 NAME2_inB NAME2_noB;
set have;
RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
start=1;
call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
NAME2_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME2);
do while (pos > 0);
NAME2_inB = substr(NAME2, pos+1, length-2);
output;
call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
end;
keep NAME2 NAME2_inB NAME2_noB;
run;
data want;
merge one two;
run;
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.