BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Alexxxxxxx
Pyrite | Level 9

Dear all,

 

I have 'company names' dataset. For each company, I have four different company names (i.g., NAME1, NAME2, NAME_SHORT, Previous_name) and expect to process each of them. for each 'company name' variable, I expect to find all strings between (),[],and {} (such as <BR>, [FONT],{BODY},'A',"JUICE") and split them in a new variable (i.e., &COMPANY_NAME._inB, and &COMPANY_NAME._noB).

 

for example 

table a 

noNAME1NAME2NAME_SHORTPrevious_name
1JUICE<BR>apple[footer] HARDY(FRNS.)'A'HARDYHARDY
2juice <BR> apple juice (BR) apple juice (BR) apple juice (BR) apple 
3juice<BODY> 'apple'juice(BODY)juice(BODY)juice(BODY)
4<figure> "juice" LTD "juice" LTD"juice" LTD

by using 'NAME1' and 'NAME2' as an example, I expect to get 

 NAME1NAME1_inBNAME1_noBNAME2NAME2_inBNAME2_noB
1JUICE<BR>apple[footer] BRJUICE apple HARDY(FRNS.)'A'FRNS.HARDY
1JUICE<BR>apple[footer] footerJUICE apple HARDY(FRNS.)'A'AHARDY
2juice <BR> apple BRjuice  apple juice (BR) apple BRjuice apple 
3juice<BODY> 'apple'BODYjuicejuice(BODY)BODYjuice
3juice<BODY> 'apple'applejuicejuice(BODY)BODYjuice
4<figure> "juice" LTDfigureLTD   
4<figure> "juice" LTDjuiceLTD   

 

 

data have;
input NAME1 $100. NAME2 $100. NAME_SHORT $100. Previous_name $100.;
infile datalines dlm="," missover;
datalines;
JUICE<BR>apple[footer],  HARDY(FRNS.)'A',  HARDY,  HARDY
juice <BR> apple,  juice (BR) apple, juice (BR) apple, juice (BR) apple 
juice<BODY> 'apple', juice(BODY), juice(BODY), juice(BODY)
<figure> "juice" LTD, ,"juice" LTD, "juice" LTD
;
run;

 

Edit:

at the current stage, I can find and split the variable by the code,

Data step9.Patstat_gb_hrm_Step3;
Set step9.Patstat_gb_hrm_Step23;
Run;

%MACRO CompnayNameM(no=,Company_name=);

data step9.Patstat_gb_hrm_Step3;
   set step9.Patstat_gb_hrm_Step3;
   RegExID = prxparse('/<\w*>|\[\w*\]|\(\w*\)|\(\w*\)|"\w*"|''\w*''/');
   start=1;
   stop=length(&COMPANY_NAME._Step23);
   call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
      do while (pos > 0);
         &COMPANY_NAME._inB = substr(&COMPANY_NAME._Step23, pos+1, length-2);
         &COMPANY_NAME._noB = prxchange('s/<\w*>|\[\w*\]|\(\w*\)|\(\w*\)|"\w*"|''\w*''/ /', -1, &COMPANY_NAME._Step23);
         call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
      end;
	  drop RegExID pos length start stop;
run;

proc sql;
create table PATSTAT&no. as
select distinct
&COMPANY_NAME.,
&COMPANY_NAME._Step23,
&COMPANY_NAME._inB,
&COMPANY_NAME._noB
from step9.Patstat_gb_hrm_Step3
where &COMPANY_NAME._inB ne ''
;
quit;

%MEND CompnayNameM;

%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
run;

however, the value like,

3M INNOVATIVE PROPERTIES COMPANY (MINNESOTA MINING  &  MANUFACTURING INNOVATIVE PROPERTIES COMPANY)

or 

3RD ANGLE (U.K.)

or 

ABB (ASEA BROWN BOVERI)

cannot be processed. could you give me some suggestions about this?

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

You can apply the same logic as in your previous thread on each variable and merge like this

 

data have;
input NAME1:$100. NAME2:$100. NAME_SHORT:$100. Previous_name:$100.;
infile datalines dlm="," missover;
datalines;
JUICE<BR>apple[footer],  HARDY(FRNS.)'A',  HARDY,  HARDY
juice <BR> apple,  juice (BR) apple, juice (BR) apple, juice (BR) apple 
juice<BODY> 'apple', juice(BODY), juice(BODY), juice(BODY)
<figure> "juice" LTD, ,"juice" LTD, "juice" LTD
ABB (ASEA BROWN BOVERI)
;
run;

data one;
   format NAME1 NAME1_inB NAME1_noB;
   set have;
   RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
   start=1;
   call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
   NAME1_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME1);
      do while (pos > 0);
         NAME1_inB = substr(NAME1, pos+1, length-2);
         output;
         call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
      end;
   keep NAME1 NAME1_inB NAME1_noB;
run;

data two;
   format NAME2 NAME2_inB NAME2_noB;
   set have;
   RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
   start=1;
   call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
   NAME2_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME2);
      do while (pos > 0);
         NAME2_inB = substr(NAME2, pos+1, length-2);
         output;
         call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
      end;
   keep NAME2 NAME2_inB NAME2_noB;
run;

data want;
   merge one two;
run;

View solution in original post

1 REPLY 1
PeterClemmensen
Tourmaline | Level 20

You can apply the same logic as in your previous thread on each variable and merge like this

 

data have;
input NAME1:$100. NAME2:$100. NAME_SHORT:$100. Previous_name:$100.;
infile datalines dlm="," missover;
datalines;
JUICE<BR>apple[footer],  HARDY(FRNS.)'A',  HARDY,  HARDY
juice <BR> apple,  juice (BR) apple, juice (BR) apple, juice (BR) apple 
juice<BODY> 'apple', juice(BODY), juice(BODY), juice(BODY)
<figure> "juice" LTD, ,"juice" LTD, "juice" LTD
ABB (ASEA BROWN BOVERI)
;
run;

data one;
   format NAME1 NAME1_inB NAME1_noB;
   set have;
   RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
   start=1;
   call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
   NAME1_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME1);
      do while (pos > 0);
         NAME1_inB = substr(NAME1, pos+1, length-2);
         output;
         call prxnext(RegExID, start, length(NAME1), NAME1, pos, length);
      end;
   keep NAME1 NAME1_inB NAME1_noB;
run;

data two;
   format NAME2 NAME2_inB NAME2_noB;
   set have;
   RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
   start=1;
   call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
   NAME2_noB=prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, NAME2);
      do while (pos > 0);
         NAME2_inB = substr(NAME2, pos+1, length-2);
         output;
         call prxnext(RegExID, start, length(NAME2), NAME2, pos, length);
      end;
   keep NAME2 NAME2_inB NAME2_noB;
run;

data want;
   merge one two;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 1 reply
  • 449 views
  • 1 like
  • 2 in conversation