BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Alexxxxxxx
Pyrite | Level 9

Dear all,

 

how can I split the value when they are included in the ( ),[ ],{ },' '," ", or separated by blank, and then count the frequency of their appearance? 

 

for example,

 

for table A

noname 
1juice<BR>a@pple[footer] 
2juice <BR> apple 
3juice<BODY> 'apple' 
4juice{BODY} apple 
5[BR]juice appl'e
6

<figure> "juice" LTD 

 

I expect to get table b

noname new_namecount the frequency
1juice<BR>a@pple[footer] juice6
1juice<BR>a@pple[footer] BR3
1juice<BR>a@pple[footer] a@pple1
1juice<BR>a@pple[footer] footer1
2juice <BR> apple juice 6
2juice <BR> apple BR3
2juice <BR> apple apple 2
3juice<BODY> 'apple' juice 6
3juice<BODY> 'apple' BODY2
3juice<BODY> 'apple' apple2
4juice{BODY} CO.LTDjuice6
4juice{BODY} CO.LTDBODY2
4juice{BODY} CO.LTDCO.LTD1
5[BR]juice appl'eBR3
5[BR]juice appl'ejuice6
5[BR]juice appl'eappl'e1
6<figure> "juice" LTD figure1
6<figure> "juice" LTD juice6
6<figure> "juice" LTD LTD 1

 

'count the frequency' count the frequency of 'new_name’ appearance in the total 'new_name' variables.

Could you please give me some suggestions about this?

thanks in advance.

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

This way to split is more generic and easier to manage:

data HAVE;
input NO NAME &:$100.;
infile datalines  missover;
datalines;
1 juice<BR>a@pple[footer] 
2 juice <BR> apple 
3 juice<BODY> 'apple' 
4 juice{BODY} apple 
5 [BR]juice appl'e
6 <figure> "juice" LTD 
run;

data WANT;
  set HAVE;
  length PAIR MATCH_PAIR MATCH_PAIRS $200 WORD $20;
  retain REGEX;
  array PAIRS [12] $1 _temporary_ ( '['  ']'  '{'  '}'  '<'  '>'  '"'  '"'  "'"  "'"  '('  ')' ) ;
  if _N_=1 then do;
    do I=1 to 12 by 2;
      MATCH_PAIR  = catt('\', PAIRS[I], '(.*)\', PAIRS[I+1]);
      MATCH_PAIRS = catx('|', MATCH_PAIRS, MATCH_PAIR);
    end;
    REGEX = prxparse(catt('s/', MATCH_PAIRS, '/ $1$2$3$4$5$6 /'));
  end;
  NAME1=prxchange(REGEX, -1, NAME);
  do I=1 to countw(NAME1,' ');
    WORD=scan(NAME1, I);
    output;
  end;
  keep NO WORD;
run;

NO WORD
1 juice
1 BR
1 a@pple
1 footer
2 juice
2 BR
2 apple
3 juice
3 BODY
3 apple
4 juice
4 BODY
4 apple
5 BR
5 juice
5 appl'e
6 figure
6 juice
6 LTD

 

 

View solution in original post

8 REPLIES 8
ChrisNZ
Tourmaline | Level 20

You already have the split part from previous replies.

Now you must:

- count

- merge the counts and the splits

Please try and report.

Alexxxxxxx
Pyrite | Level 9

Dear @ChrisNZ 

 

thanks. yes, I can split the variable based on brackets etc. by the following code,

 

Data step9.Patstat_gb_hrm_Step3;
Set step9.Patstat_gb_hrm_Step23;
Run;

%MACRO CompnayNameM(no=,Company_name=);

data step9.Patstat_gb_hrm_Step3;
   set step9.Patstat_gb_hrm_Step3;
   RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
   start=1;
   stop=length(&COMPANY_NAME._Step23);
   call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
      do while (pos > 0);
         &COMPANY_NAME._inB = substr(&COMPANY_NAME._Step23, pos+1, length-2);
         &COMPANY_NAME._noB = prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, &COMPANY_NAME._Step23);
         call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
      end;
	  drop RegExID pos length start stop;
run;

proc sql;
create table PATSTAT&no. as
select distinct
&COMPANY_NAME.,
&COMPANY_NAME._Step23,
&COMPANY_NAME._inB,
&COMPANY_NAME._noB
from step9.Patstat_gb_hrm_Step3
where &COMPANY_NAME._inB ne ''
;
quit;

%MEND CompnayNameM;

%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
run;

However, I do not know how to separate them based on the blank. Could you please give me some suggestion about this part?

ChrisNZ
Tourmaline | Level 20

This way to split is more generic and easier to manage:

data HAVE;
input NO NAME &:$100.;
infile datalines  missover;
datalines;
1 juice<BR>a@pple[footer] 
2 juice <BR> apple 
3 juice<BODY> 'apple' 
4 juice{BODY} apple 
5 [BR]juice appl'e
6 <figure> "juice" LTD 
run;

data WANT;
  set HAVE;
  length PAIR MATCH_PAIR MATCH_PAIRS $200 WORD $20;
  retain REGEX;
  array PAIRS [12] $1 _temporary_ ( '['  ']'  '{'  '}'  '<'  '>'  '"'  '"'  "'"  "'"  '('  ')' ) ;
  if _N_=1 then do;
    do I=1 to 12 by 2;
      MATCH_PAIR  = catt('\', PAIRS[I], '(.*)\', PAIRS[I+1]);
      MATCH_PAIRS = catx('|', MATCH_PAIRS, MATCH_PAIR);
    end;
    REGEX = prxparse(catt('s/', MATCH_PAIRS, '/ $1$2$3$4$5$6 /'));
  end;
  NAME1=prxchange(REGEX, -1, NAME);
  do I=1 to countw(NAME1,' ');
    WORD=scan(NAME1, I);
    output;
  end;
  keep NO WORD;
run;

NO WORD
1 juice
1 BR
1 a@pple
1 footer
2 juice
2 BR
2 apple
3 juice
3 BODY
3 apple
4 juice
4 BODY
4 apple
5 BR
5 juice
5 appl'e
6 figure
6 juice
6 LTD

 

 

Alexxxxxxx
Pyrite | Level 9

Dear @ChrisNZ 

 

I really appreciate your help, which works very well. however, I face a problem during the process.

 

I find some value which includes '&' ( for example, 'SWAN JOHN & SONS',  'SMITH & NEPHEW NEW') are not been processed by the code. and I expect to treat it as a blank.

 

for example,

noname 
1SWAN JOHN & SONS
1SMITH & NEPHEW NEW

 

noname new_name
1SWAN JOHN & SONSSWAN
1SWAN JOHN & SONSJOHN
1SWAN JOHN & SONSSONS
2SMITH & NEPHEW NEWSMITH 
2SMITH & NEPHEW NEWNEPHEW
2SMITH & NEPHEW NEWNEW

 

 

Could you please give me some suggestion about this?

thanks in advance

 

ChrisNZ
Tourmaline | Level 20

Odd, it seems to works for me.

Otherwise, just replace string ' & ' with ' ' ?

Alexxxxxxx
Pyrite | Level 9

Dear @ChrisNZ 

thanks for your help.

 

may I ask one more question, please?

 

how can I split the variable include '.'?

 

for example,

for 'M & L PROPERTY & ASS.PLC.'

I get 

M & L PROPERTY & ASS.PLC.M
M & L PROPERTY & ASS.PLC.L
M & L PROPERTY & ASS.PLC.PROPERTY
M & L PROPERTY & ASS.PLC.ASS
M & L PROPERTY & ASS.PLC.PLC

 

however, I expect to get

M & L PROPERTY & ASS.PLC.M
M & L PROPERTY & ASS.PLC.L
M & L PROPERTY & ASS.PLC.PROPERTY
M & L PROPERTY & ASS.PLC.ASS.PLC.

 

 

Tom
Super User Tom
Super User

How important is it to treat this strange value as a single word and not two?

appl'e

If you don't really need to match the various brackets you can just treat them all as delimiters.

data want ;
 set have ;
 word=name;
 do word_no=1 by 1 until(word=' ');
   word =scan(name,word_no,' []{}<>"''');
   if word_no=1 or word ne ' ' then output;
 end;
run;
The FREQ Procedure

                                   Cumulative    Cumulative
word      Frequency     Percent     Frequency      Percent
-----------------------------------------------------------
BODY             2       10.00             2        10.00
BR               3       15.00             5        25.00
LTD              1        5.00             6        30.00
a@pple           1        5.00             7        35.00
appl             1        5.00             8        40.00
apple            3       15.00            11        55.00
e                1        5.00            12        60.00
figure           1        5.00            13        65.00
footer           1        5.00            14        70.00
juice            6       30.00            20       100.00

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 659 views
  • 2 likes
  • 3 in conversation