Solved: split the value when they are included in the ( ),[ ],{ },' '," ", sep...

Alexxxxxxx · Posted 04-03-2019 08:08 PM

Dear all,

how can I split the value when they are included in the ( ),[ ],{ },' '," ", or separated by blank, and then count the frequency of their appearance?

for example,

for table A

no	name
1	juice<BR>a@pple[footer]
2	juice <BR> apple
3	juice<BODY> 'apple'
4	juice{BODY} apple
5	[BR]juice appl'e
6	<figure> "juice" LTD

I expect to get table b

no	name	new_name	count the frequency
1	juice<BR>a@pple[footer]	juice	6
1	juice<BR>a@pple[footer]	BR	3
1	juice<BR>a@pple[footer]	a@pple	1
1	juice<BR>a@pple[footer]	footer	1
2	juice <BR> apple	juice	6
2	juice <BR> apple	BR	3
2	juice <BR> apple	apple	2
3	juice<BODY> 'apple'	juice	6
3	juice<BODY> 'apple'	BODY	2
3	juice<BODY> 'apple'	apple	2
4	juice{BODY} CO.LTD	juice	6
4	juice{BODY} CO.LTD	BODY	2
4	juice{BODY} CO.LTD	CO.LTD	1
5	[BR]juice appl'e	BR	3
5	[BR]juice appl'e	juice	6
5	[BR]juice appl'e	appl'e	1
6	<figure> "juice" LTD	figure	1
6	<figure> "juice" LTD	juice	6
6	<figure> "juice" LTD	LTD	1

'count the frequency' count the frequency of 'new_name’ appearance in the total 'new_name' variables.

Could you please give me some suggestions about this?

thanks in advance.

ChrisNZ · Posted 04-03-2019 10:43 PM

This way to split is more generic and easier to manage:

data HAVE;
input NO NAME &:$100.;
infile datalines  missover;
datalines;
1 juice<BR>a@pple[footer] 
2 juice <BR> apple 
3 juice<BODY> 'apple' 
4 juice{BODY} apple 
5 [BR]juice appl'e
6 <figure> "juice" LTD 
run;

data WANT;
  set HAVE;
  length PAIR MATCH_PAIR MATCH_PAIRS $200 WORD $20;
  retain REGEX;
  array PAIRS [12] $1 _temporary_ ( '['  ']'  '{'  '}'  '<'  '>'  '"'  '"'  "'"  "'"  '('  ')' ) ;
  if _N_=1 then do;
    do I=1 to 12 by 2;
      MATCH_PAIR  = catt('\', PAIRS[I], '(.*)\', PAIRS[I+1]);
      MATCH_PAIRS = catx('|', MATCH_PAIRS, MATCH_PAIR);
    end;
    REGEX = prxparse(catt('s/', MATCH_PAIRS, '/ $1$2$3$4$5$6 /'));
  end;
  NAME1=prxchange(REGEX, -1, NAME);
  do I=1 to countw(NAME1,' ');
    WORD=scan(NAME1, I);
    output;
  end;
  keep NO WORD;
run;

NO	WORD
1	juice
1	BR
1	a@pple
1	footer
2	juice
2	BR
2	apple
3	juice
3	BODY
3	apple
4	juice
4	BODY
4	apple
5	BR
5	juice
5	appl'e
6	figure
6	juice
6	LTD

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 04-03-2019 09:25 PM

You already have the split part from previous replies.

Now you must:

- count

- merge the counts and the splits

Please try and report.

High-Performance SAS Coding - Third Edition

Alexxxxxxx · Posted 04-03-2019 09:47 PM

Dear @ChrisNZ

thanks. yes, I can split the variable based on brackets etc. by the following code,

Data step9.Patstat_gb_hrm_Step3;
Set step9.Patstat_gb_hrm_Step23;
Run;

%MACRO CompnayNameM(no=,Company_name=);

data step9.Patstat_gb_hrm_Step3;
   set step9.Patstat_gb_hrm_Step3;
   RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
   start=1;
   stop=length(&COMPANY_NAME._Step23);
   call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
      do while (pos > 0);
         &COMPANY_NAME._inB = substr(&COMPANY_NAME._Step23, pos+1, length-2);
         &COMPANY_NAME._noB = prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, &COMPANY_NAME._Step23);
         call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
      end;
	  drop RegExID pos length start stop;
run;

proc sql;
create table PATSTAT&no. as
select distinct
&COMPANY_NAME.,
&COMPANY_NAME._Step23,
&COMPANY_NAME._inB,
&COMPANY_NAME._noB
from step9.Patstat_gb_hrm_Step3
where &COMPANY_NAME._inB ne ''
;
quit;

%MEND CompnayNameM;

%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
run;

However, I do not know how to separate them based on the blank. Could you please give me some suggestion about this part?

ChrisNZ · Posted 04-03-2019 10:43 PM

This way to split is more generic and easier to manage:

data HAVE;
input NO NAME &:$100.;
infile datalines  missover;
datalines;
1 juice<BR>a@pple[footer] 
2 juice <BR> apple 
3 juice<BODY> 'apple' 
4 juice{BODY} apple 
5 [BR]juice appl'e
6 <figure> "juice" LTD 
run;

data WANT;
  set HAVE;
  length PAIR MATCH_PAIR MATCH_PAIRS $200 WORD $20;
  retain REGEX;
  array PAIRS [12] $1 _temporary_ ( '['  ']'  '{'  '}'  '<'  '>'  '"'  '"'  "'"  "'"  '('  ')' ) ;
  if _N_=1 then do;
    do I=1 to 12 by 2;
      MATCH_PAIR  = catt('\', PAIRS[I], '(.*)\', PAIRS[I+1]);
      MATCH_PAIRS = catx('|', MATCH_PAIRS, MATCH_PAIR);
    end;
    REGEX = prxparse(catt('s/', MATCH_PAIRS, '/ $1$2$3$4$5$6 /'));
  end;
  NAME1=prxchange(REGEX, -1, NAME);
  do I=1 to countw(NAME1,' ');
    WORD=scan(NAME1, I);
    output;
  end;
  keep NO WORD;
run;

NO	WORD
1	juice
1	BR
1	a@pple
1	footer
2	juice
2	BR
2	apple
3	juice
3	BODY
3	apple
4	juice
4	BODY
4	apple
5	BR
5	juice
5	appl'e
6	figure
6	juice
6	LTD

High-Performance SAS Coding - Third Edition

Alexxxxxxx · Posted 04-04-2019 10:23 PM

Dear @ChrisNZ

I really appreciate your help, which works very well. however, I face a problem during the process.

I find some value which includes '&' ( for example, 'SWAN JOHN & SONS', 'SMITH & NEPHEW NEW') are not been processed by the code. and I expect to treat it as a blank.

for example,

no	name
1	SWAN JOHN & SONS
1	SMITH & NEPHEW NEW

no	name	new_name
1	SWAN JOHN & SONS	SWAN
1	SWAN JOHN & SONS	JOHN
1	SWAN JOHN & SONS	SONS
2	SMITH & NEPHEW NEW	SMITH
2	SMITH & NEPHEW NEW	NEPHEW
2	SMITH & NEPHEW NEW	NEW

Could you please give me some suggestion about this?

thanks in advance

ChrisNZ · Posted 04-05-2019 12:28 AM

Odd, it seems to works for me.

Otherwise, just replace string ' & ' with ' ' ?

High-Performance SAS Coding - Third Edition

Alexxxxxxx · Posted 04-05-2019 07:15 PM

Dear @ChrisNZ

thanks for your help.

may I ask one more question, please?

how can I split the variable include '.'?

for example,

for 'M & L PROPERTY & ASS.PLC.'

I get

M & L PROPERTY & ASS.PLC.	M
M & L PROPERTY & ASS.PLC.	L
M & L PROPERTY & ASS.PLC.	PROPERTY
M & L PROPERTY & ASS.PLC.	ASS
M & L PROPERTY & ASS.PLC.	PLC

however, I expect to get

M & L PROPERTY & ASS.PLC.	M
M & L PROPERTY & ASS.PLC.	L
M & L PROPERTY & ASS.PLC.	PROPERTY
M & L PROPERTY & ASS.PLC.	ASS.PLC.

I

ChrisNZ · Posted 04-07-2019 05:11 PM

Add a space as the third parameter of the scan function.

High-Performance SAS Coding - Third Edition

Tom · Posted 04-03-2019 11:10 PM

How important is it to treat this strange value as a single word and not two?

appl'e

If you don't really need to match the various brackets you can just treat them all as delimiters.

data want ;
 set have ;
 word=name;
 do word_no=1 by 1 until(word=' ');
   word =scan(name,word_no,' []{}<>"''');
   if word_no=1 or word ne ' ' then output;
 end;
run;

The FREQ Procedure

                                   Cumulative    Cumulative
word      Frequency     Percent     Frequency      Percent
-----------------------------------------------------------
BODY             2       10.00             2        10.00
BR               3       15.00             5        25.00
LTD              1        5.00             6        30.00
a@pple           1        5.00             7        35.00
appl             1        5.00             8        40.00
apple            3       15.00            11        55.00
e                1        5.00            12        60.00
figure           1        5.00            13        65.00
footer           1        5.00            14        70.00
juice            6       30.00            20       100.00

split the value when they are included in the ( ),[ ],{ },' '," ", separated by blank and count freq

Re: split the value when they are included in the ( ),[ ],{ },' '," ", separated by blank

Re: split the value when they are included in the ( ),[ ],{ },' '," ", separated by blank

Re: split the value when they are included in the ( ),[ ],{ },' '," ", separated by blank

Re: split the value when they are included in the ( ),[ ],{ },' '," ", separated by blank

Re: split the value when they are included in the ( ),[ ],{ },' '," ", separated by blank

Re: split the value when they are included in the ( ),[ ],{ },' '," ", separated by blank

Re: split the value when they are included in the ( ),[ ],{ },' '," ", separated by blank

Re: split the value when they are included in the ( ),[ ],{ },' '," ", separated by blank

Re: split the value when they are included in the ( ),[ ],{ },' '," ", separated by blank

Registration is open

SAS Training: Just a Click Away