BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Alexxxxxxx
Pyrite | Level 9

Dear all,

 

I have 'company names' dataset. For each company, I have four different company names (i.g., NAME1, NAME2, NAME_SHORT, Previous_name) and expect to process each of them. for each 'company name' variable, I expect to find all strings between (),[],and {} (such as <BR>, [FONT],{BODY},'A',"JUICE") and split them in a new variable (i.e., &COMPANY_NAME._inB, and &COMPANY_NAME._noB).

 

for example 

table a 

noNAME1NAME2NAME_SHORTPrevious_name
1JUICE<BR>apple[footer] HARDY(FRNS.)'A'HARDYHARDY
2juice <BR> apple juice (BR) apple juice (BR) apple juice (BR) apple 
3juice<BODY> 'apple'juice(BODY)juice(BODY)juice(BODY)
4<figure> "juice" LTD "juice" LTD"juice" LTD

by using 'NAME1' and 'NAME2' as an example, I expect to get 

 NAME1NAME1_inBNAME1_noBNAME2NAME2_inBNAME2_noB
1JUICE<BR>apple[footer] BRJUICE apple HARDY(FRNS.)'A'FRNS.HARDY
1JUICE<BR>apple[footer] footerJUICE apple HARDY(FRNS.)'A'AHARDY
2juice <BR> apple BRjuice  apple juice (BR) apple BRjuice apple 
3juice<BODY> 'apple'BODYjuicejuice(BODY)BODYjuice
3juice<BODY> 'apple'applejuicejuice(BODY)BODYjuice
4<figure> "juice" LTDfigureLTD   
4<figure> "juice" LTDjuiceLTD   

 

 

data have;
input NAME1 $100. NAME2 $100. NAME_SHORT $100. Previous_name $100.;
infile datalines dlm="," missover;
datalines;
JUICE<BR>apple[footer],  HARDY(FRNS.)'A',  HARDY,  HARDY
juice <BR> apple,  juice (BR) apple, juice (BR) apple, juice (BR) apple 
juice<BODY> 'apple', juice(BODY), juice(BODY), juice(BODY)
<figure> "juice" LTD, ,"juice" LTD, "juice" LTD
;
run;

 

Edit:

at the current stage, I can find and split the variable by the code,

Data step9.Patstat_gb_hrm_Step3;
Set step9.Patstat_gb_hrm_Step23;
Run;

%MACRO CompnayNameM(no=,Company_name=);

data step9.Patstat_gb_hrm_Step3;
   set step9.Patstat_gb_hrm_Step3;
   RegExID = prxparse('/<\w*>|\[\w*\]|\(\w*\)|\(\w*\)|"\w*"|''\w*''/');
   start=1;
   stop=length(&COMPANY_NAME._Step23);
   call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
      do while (pos > 0);
         &COMPANY_NAME._inB = substr(&COMPANY_NAME._Step23, pos+1, length-2);
         &COMPANY_NAME._noB = prxchange('s/<\w*>|\[\w*\]|\(\w*\)|\(\w*\)|"\w*"|''\w*''/ /', -1, &COMPANY_NAME._Step23);
         call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
      end;
	  drop RegExID pos length start stop;
run;

proc sql;
create table PATSTAT&no. as
select distinct
&COMPANY_NAME.,
&COMPANY_NAME._Step23,
&COMPANY_NAME._inB,
&COMPANY_NAME._noB
from step9.Patstat_gb_hrm_Step3
where &COMPANY_NAME._inB ne ''
;
quit;

%MEND CompnayNameM;

%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
run;

however, the value like,

3M INNOVATIVE PROPERTIES COMPANY (MINNESOTA MINING  &  MANUFACTURING INNOVATIVE PROPERTIES COMPANY)

or 

3RD ANGLE (U.K.)

or 

ABB (ASEA BROWN BOVERI)

cannot be processed. could you give me some suggestions about this?

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

You can apply the same logic as in your previous thread on each variable and merge like this

 

data have;
input NAME1:$100. NAME2:$100. NAME_SHORT:$100. Previous_name:$100.;
infile datalines dlm="," missover;
datalines;
JUICE<BR>apple[footer],  HARDY(FRNS.)'A',  HARDY,  HARDY
juice <BR> apple,  juice (BR) apple, juice (BR) apple, juice (BR) apple 
juice<BODY> 'apple', juice(BODY), juice(BODY), juice(BODY)
<figure> "juice" LTD, ,"juice" LTD, "juice" LTD
ABB (ASEA BROWN BOVERI)
;
run;

data one;
   format NAME1 NAME1_inB NAME1_noB;
   set have;
   RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
   start=1;
   call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
   NAME1_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME1);
      do while (pos > 0);
         NAME1_inB = substr(NAME1, pos+1, length-2);
         output;
         call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
      end;
   keep NAME1 NAME1_inB NAME1_noB;
run;

data two;
   format NAME2 NAME2_inB NAME2_noB;
   set have;
   RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
   start=1;
   call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
   NAME2_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME2);
      do while (pos > 0);
         NAME2_inB = substr(NAME2, pos+1, length-2);
         output;
         call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
      end;
   keep NAME2 NAME2_inB NAME2_noB;
run;

data want;
   merge one two;
run;

View solution in original post

1 REPLY 1
PeterClemmensen
Tourmaline | Level 20

You can apply the same logic as in your previous thread on each variable and merge like this

 

data have;
input NAME1:$100. NAME2:$100. NAME_SHORT:$100. Previous_name:$100.;
infile datalines dlm="," missover;
datalines;
JUICE<BR>apple[footer],  HARDY(FRNS.)'A',  HARDY,  HARDY
juice <BR> apple,  juice (BR) apple, juice (BR) apple, juice (BR) apple 
juice<BODY> 'apple', juice(BODY), juice(BODY), juice(BODY)
<figure> "juice" LTD, ,"juice" LTD, "juice" LTD
ABB (ASEA BROWN BOVERI)
;
run;

data one;
   format NAME1 NAME1_inB NAME1_noB;
   set have;
   RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
   start=1;
   call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
   NAME1_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME1);
      do while (pos > 0);
         NAME1_inB = substr(NAME1, pos+1, length-2);
         output;
         call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
      end;
   keep NAME1 NAME1_inB NAME1_noB;
run;

data two;
   format NAME2 NAME2_inB NAME2_noB;
   set have;
   RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
   start=1;
   call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
   NAME2_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME2);
      do while (pos > 0);
         NAME2_inB = substr(NAME2, pos+1, length-2);
         output;
         call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
      end;
   keep NAME2 NAME2_inB NAME2_noB;
run;

data want;
   merge one two;
run;

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 1 reply
  • 768 views
  • 1 like
  • 2 in conversation