BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Alexxxxxxx
Pyrite | Level 9

Dear all,

 

how can I split the value when they are included in the ( ),[ ],{ },' '," ", or separated by blank, and then count the frequency of their appearance? 

 

for example,

 

for table A

noname 
1juice<BR>a@pple[footer] 
2juice <BR> apple 
3juice<BODY> 'apple' 
4juice{BODY} apple 
5[BR]juice appl'e
6

<figure> "juice" LTD 

 

I expect to get table b

noname new_namecount the frequency
1juice<BR>a@pple[footer] juice6
1juice<BR>a@pple[footer] BR3
1juice<BR>a@pple[footer] a@pple1
1juice<BR>a@pple[footer] footer1
2juice <BR> apple juice 6
2juice <BR> apple BR3
2juice <BR> apple apple 2
3juice<BODY> 'apple' juice 6
3juice<BODY> 'apple' BODY2
3juice<BODY> 'apple' apple2
4juice{BODY} CO.LTDjuice6
4juice{BODY} CO.LTDBODY2
4juice{BODY} CO.LTDCO.LTD1
5[BR]juice appl'eBR3
5[BR]juice appl'ejuice6
5[BR]juice appl'eappl'e1
6<figure> "juice" LTD figure1
6<figure> "juice" LTD juice6
6<figure> "juice" LTD LTD 1

 

'count the frequency' count the frequency of 'new_name’ appearance in the total 'new_name' variables.

Could you please give me some suggestions about this?

thanks in advance.

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

This way to split is more generic and easier to manage:

data HAVE;
input NO NAME &:$100.;
infile datalines  missover;
datalines;
1 juice<BR>a@pple[footer] 
2 juice <BR> apple 
3 juice<BODY> 'apple' 
4 juice{BODY} apple 
5 [BR]juice appl'e
6 <figure> "juice" LTD 
run;

data WANT;
  set HAVE;
  length PAIR MATCH_PAIR MATCH_PAIRS $200 WORD $20;
  retain REGEX;
  array PAIRS [12] $1 _temporary_ ( '['  ']'  '{'  '}'  '<'  '>'  '"'  '"'  "'"  "'"  '('  ')' ) ;
  if _N_=1 then do;
    do I=1 to 12 by 2;
      MATCH_PAIR  = catt('\', PAIRS[I], '(.*)\', PAIRS[I+1]);
      MATCH_PAIRS = catx('|', MATCH_PAIRS, MATCH_PAIR);
    end;
    REGEX = prxparse(catt('s/', MATCH_PAIRS, '/ $1$2$3$4$5$6 /'));
  end;
  NAME1=prxchange(REGEX, -1, NAME);
  do I=1 to countw(NAME1,' ');
    WORD=scan(NAME1, I);
    output;
  end;
  keep NO WORD;
run;

NO WORD
1 juice
1 BR
1 a@pple
1 footer
2 juice
2 BR
2 apple
3 juice
3 BODY
3 apple
4 juice
4 BODY
4 apple
5 BR
5 juice
5 appl'e
6 figure
6 juice
6 LTD

 

 

View solution in original post

8 REPLIES 8
ChrisNZ
Tourmaline | Level 20

You already have the split part from previous replies.

Now you must:

- count

- merge the counts and the splits

Please try and report.

Alexxxxxxx
Pyrite | Level 9

Dear @ChrisNZ 

 

thanks. yes, I can split the variable based on brackets etc. by the following code,

 

Data step9.Patstat_gb_hrm_Step3;
Set step9.Patstat_gb_hrm_Step23;
Run;

%MACRO CompnayNameM(no=,Company_name=);

data step9.Patstat_gb_hrm_Step3;
   set step9.Patstat_gb_hrm_Step3;
   RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
   start=1;
   stop=length(&COMPANY_NAME._Step23);
   call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
      do while (pos > 0);
         &COMPANY_NAME._inB = substr(&COMPANY_NAME._Step23, pos+1, length-2);
         &COMPANY_NAME._noB = prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, &COMPANY_NAME._Step23);
         call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
      end;
	  drop RegExID pos length start stop;
run;

proc sql;
create table PATSTAT&no. as
select distinct
&COMPANY_NAME.,
&COMPANY_NAME._Step23,
&COMPANY_NAME._inB,
&COMPANY_NAME._noB
from step9.Patstat_gb_hrm_Step3
where &COMPANY_NAME._inB ne ''
;
quit;

%MEND CompnayNameM;

%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
run;

However, I do not know how to separate them based on the blank. Could you please give me some suggestion about this part?

ChrisNZ
Tourmaline | Level 20

This way to split is more generic and easier to manage:

data HAVE;
input NO NAME &:$100.;
infile datalines  missover;
datalines;
1 juice<BR>a@pple[footer] 
2 juice <BR> apple 
3 juice<BODY> 'apple' 
4 juice{BODY} apple 
5 [BR]juice appl'e
6 <figure> "juice" LTD 
run;

data WANT;
  set HAVE;
  length PAIR MATCH_PAIR MATCH_PAIRS $200 WORD $20;
  retain REGEX;
  array PAIRS [12] $1 _temporary_ ( '['  ']'  '{'  '}'  '<'  '>'  '"'  '"'  "'"  "'"  '('  ')' ) ;
  if _N_=1 then do;
    do I=1 to 12 by 2;
      MATCH_PAIR  = catt('\', PAIRS[I], '(.*)\', PAIRS[I+1]);
      MATCH_PAIRS = catx('|', MATCH_PAIRS, MATCH_PAIR);
    end;
    REGEX = prxparse(catt('s/', MATCH_PAIRS, '/ $1$2$3$4$5$6 /'));
  end;
  NAME1=prxchange(REGEX, -1, NAME);
  do I=1 to countw(NAME1,' ');
    WORD=scan(NAME1, I);
    output;
  end;
  keep NO WORD;
run;

NO WORD
1 juice
1 BR
1 a@pple
1 footer
2 juice
2 BR
2 apple
3 juice
3 BODY
3 apple
4 juice
4 BODY
4 apple
5 BR
5 juice
5 appl'e
6 figure
6 juice
6 LTD

 

 

Alexxxxxxx
Pyrite | Level 9

Dear @ChrisNZ 

 

I really appreciate your help, which works very well. however, I face a problem during the process.

 

I find some value which includes '&' ( for example, 'SWAN JOHN & SONS',  'SMITH & NEPHEW NEW') are not been processed by the code. and I expect to treat it as a blank.

 

for example,

noname 
1SWAN JOHN & SONS
1SMITH & NEPHEW NEW

 

noname new_name
1SWAN JOHN & SONSSWAN
1SWAN JOHN & SONSJOHN
1SWAN JOHN & SONSSONS
2SMITH & NEPHEW NEWSMITH 
2SMITH & NEPHEW NEWNEPHEW
2SMITH & NEPHEW NEWNEW

 

 

Could you please give me some suggestion about this?

thanks in advance

 

ChrisNZ
Tourmaline | Level 20

Odd, it seems to works for me.

Otherwise, just replace string ' & ' with ' ' ?

Alexxxxxxx
Pyrite | Level 9

Dear @ChrisNZ 

thanks for your help.

 

may I ask one more question, please?

 

how can I split the variable include '.'?

 

for example,

for 'M & L PROPERTY & ASS.PLC.'

I get 

M & L PROPERTY & ASS.PLC.M
M & L PROPERTY & ASS.PLC.L
M & L PROPERTY & ASS.PLC.PROPERTY
M & L PROPERTY & ASS.PLC.ASS
M & L PROPERTY & ASS.PLC.PLC

 

however, I expect to get

M & L PROPERTY & ASS.PLC.M
M & L PROPERTY & ASS.PLC.L
M & L PROPERTY & ASS.PLC.PROPERTY
M & L PROPERTY & ASS.PLC.ASS.PLC.

 

 

Tom
Super User Tom
Super User

How important is it to treat this strange value as a single word and not two?

appl'e

If you don't really need to match the various brackets you can just treat them all as delimiters.

data want ;
 set have ;
 word=name;
 do word_no=1 by 1 until(word=' ');
   word =scan(name,word_no,' []{}<>"''');
   if word_no=1 or word ne ' ' then output;
 end;
run;
The FREQ Procedure

                                   Cumulative    Cumulative
word      Frequency     Percent     Frequency      Percent
-----------------------------------------------------------
BODY             2       10.00             2        10.00
BR               3       15.00             5        25.00
LTD              1        5.00             6        30.00
a@pple           1        5.00             7        35.00
appl             1        5.00             8        40.00
apple            3       15.00            11        55.00
e                1        5.00            12        60.00
figure           1        5.00            13        65.00
footer           1        5.00            14        70.00
juice            6       30.00            20       100.00

 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 1700 views
  • 2 likes
  • 3 in conversation