Dear all,
how can I split the value when they are included in the ( ),[ ],{ },' '," ", or separated by blank, and then count the frequency of their appearance?
for example,
for table A
no | name |
1 | juice<BR>a@pple[footer] |
2 | juice <BR> apple |
3 | juice<BODY> 'apple' |
4 | juice{BODY} apple |
5 | [BR]juice appl'e |
6 | <figure> "juice" LTD |
I expect to get table b
no | name | new_name | count the frequency |
1 | juice<BR>a@pple[footer] | juice | 6 |
1 | juice<BR>a@pple[footer] | BR | 3 |
1 | juice<BR>a@pple[footer] | a@pple | 1 |
1 | juice<BR>a@pple[footer] | footer | 1 |
2 | juice <BR> apple | juice | 6 |
2 | juice <BR> apple | BR | 3 |
2 | juice <BR> apple | apple | 2 |
3 | juice<BODY> 'apple' | juice | 6 |
3 | juice<BODY> 'apple' | BODY | 2 |
3 | juice<BODY> 'apple' | apple | 2 |
4 | juice{BODY} CO.LTD | juice | 6 |
4 | juice{BODY} CO.LTD | BODY | 2 |
4 | juice{BODY} CO.LTD | CO.LTD | 1 |
5 | [BR]juice appl'e | BR | 3 |
5 | [BR]juice appl'e | juice | 6 |
5 | [BR]juice appl'e | appl'e | 1 |
6 | <figure> "juice" LTD | figure | 1 |
6 | <figure> "juice" LTD | juice | 6 |
6 | <figure> "juice" LTD | LTD | 1 |
'count the frequency' count the frequency of 'new_name’ appearance in the total 'new_name' variables.
Could you please give me some suggestions about this?
thanks in advance.
This way to split is more generic and easier to manage:
data HAVE;
input NO NAME &:$100.;
infile datalines missover;
datalines;
1 juice<BR>a@pple[footer]
2 juice <BR> apple
3 juice<BODY> 'apple'
4 juice{BODY} apple
5 [BR]juice appl'e
6 <figure> "juice" LTD
run;
data WANT;
set HAVE;
length PAIR MATCH_PAIR MATCH_PAIRS $200 WORD $20;
retain REGEX;
array PAIRS [12] $1 _temporary_ ( '[' ']' '{' '}' '<' '>' '"' '"' "'" "'" '(' ')' ) ;
if _N_=1 then do;
do I=1 to 12 by 2;
MATCH_PAIR = catt('\', PAIRS[I], '(.*)\', PAIRS[I+1]);
MATCH_PAIRS = catx('|', MATCH_PAIRS, MATCH_PAIR);
end;
REGEX = prxparse(catt('s/', MATCH_PAIRS, '/ $1$2$3$4$5$6 /'));
end;
NAME1=prxchange(REGEX, -1, NAME);
do I=1 to countw(NAME1,' ');
WORD=scan(NAME1, I);
output;
end;
keep NO WORD;
run;
NO | WORD |
---|---|
1 | juice |
1 | BR |
1 | a@pple |
1 | footer |
2 | juice |
2 | BR |
2 | apple |
3 | juice |
3 | BODY |
3 | apple |
4 | juice |
4 | BODY |
4 | apple |
5 | BR |
5 | juice |
5 | appl'e |
6 | figure |
6 | juice |
6 | LTD |
You already have the split part from previous replies.
Now you must:
- count
- merge the counts and the splits
Please try and report.
Dear @ChrisNZ
thanks. yes, I can split the variable based on brackets etc. by the following code,
Data step9.Patstat_gb_hrm_Step3;
Set step9.Patstat_gb_hrm_Step23;
Run;
%MACRO CompnayNameM(no=,Company_name=);
data step9.Patstat_gb_hrm_Step3;
set step9.Patstat_gb_hrm_Step3;
RegExID = prxparse('/<.*>|\[.*\]|\(.*\)|".*"|''.*''/');
start=1;
stop=length(&COMPANY_NAME._Step23);
call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
do while (pos > 0);
&COMPANY_NAME._inB = substr(&COMPANY_NAME._Step23, pos+1, length-2);
&COMPANY_NAME._noB = prxchange('s/<.*>|\[.*\]|\(.*\)|".*"|''.*''/ /', -1, &COMPANY_NAME._Step23);
call prxnext(RegExID, start, stop, &COMPANY_NAME._Step23, pos, length);
end;
drop RegExID pos length start stop;
run;
proc sql;
create table PATSTAT&no. as
select distinct
&COMPANY_NAME.,
&COMPANY_NAME._Step23,
&COMPANY_NAME._inB,
&COMPANY_NAME._noB
from step9.Patstat_gb_hrm_Step3
where &COMPANY_NAME._inB ne ''
;
quit;
%MEND CompnayNameM;
%CompnayNameM(no=1,Company_name=HRM_L2)
%CompnayNameM(no=2,Company_name=PERSON_NAME)
run;
However, I do not know how to separate them based on the blank. Could you please give me some suggestion about this part?
This way to split is more generic and easier to manage:
data HAVE;
input NO NAME &:$100.;
infile datalines missover;
datalines;
1 juice<BR>a@pple[footer]
2 juice <BR> apple
3 juice<BODY> 'apple'
4 juice{BODY} apple
5 [BR]juice appl'e
6 <figure> "juice" LTD
run;
data WANT;
set HAVE;
length PAIR MATCH_PAIR MATCH_PAIRS $200 WORD $20;
retain REGEX;
array PAIRS [12] $1 _temporary_ ( '[' ']' '{' '}' '<' '>' '"' '"' "'" "'" '(' ')' ) ;
if _N_=1 then do;
do I=1 to 12 by 2;
MATCH_PAIR = catt('\', PAIRS[I], '(.*)\', PAIRS[I+1]);
MATCH_PAIRS = catx('|', MATCH_PAIRS, MATCH_PAIR);
end;
REGEX = prxparse(catt('s/', MATCH_PAIRS, '/ $1$2$3$4$5$6 /'));
end;
NAME1=prxchange(REGEX, -1, NAME);
do I=1 to countw(NAME1,' ');
WORD=scan(NAME1, I);
output;
end;
keep NO WORD;
run;
NO | WORD |
---|---|
1 | juice |
1 | BR |
1 | a@pple |
1 | footer |
2 | juice |
2 | BR |
2 | apple |
3 | juice |
3 | BODY |
3 | apple |
4 | juice |
4 | BODY |
4 | apple |
5 | BR |
5 | juice |
5 | appl'e |
6 | figure |
6 | juice |
6 | LTD |
Dear @ChrisNZ
I really appreciate your help, which works very well. however, I face a problem during the process.
I find some value which includes '&' ( for example, 'SWAN JOHN & SONS', 'SMITH & NEPHEW NEW') are not been processed by the code. and I expect to treat it as a blank.
for example,
no | name |
1 | SWAN JOHN & SONS |
1 | SMITH & NEPHEW NEW |
no | name | new_name |
1 | SWAN JOHN & SONS | SWAN |
1 | SWAN JOHN & SONS | JOHN |
1 | SWAN JOHN & SONS | SONS |
2 | SMITH & NEPHEW NEW | SMITH |
2 | SMITH & NEPHEW NEW | NEPHEW |
2 | SMITH & NEPHEW NEW | NEW |
Could you please give me some suggestion about this?
thanks in advance
Odd, it seems to works for me.
Otherwise, just replace string ' & ' with ' ' ?
Dear @ChrisNZ
thanks for your help.
may I ask one more question, please?
how can I split the variable include '.'?
for example,
for 'M & L PROPERTY & ASS.PLC.'
I get
M & L PROPERTY & ASS.PLC. | M |
M & L PROPERTY & ASS.PLC. | L |
M & L PROPERTY & ASS.PLC. | PROPERTY |
M & L PROPERTY & ASS.PLC. | ASS |
M & L PROPERTY & ASS.PLC. | PLC |
however, I expect to get
M & L PROPERTY & ASS.PLC. | M |
M & L PROPERTY & ASS.PLC. | L |
M & L PROPERTY & ASS.PLC. | PROPERTY |
M & L PROPERTY & ASS.PLC. | ASS.PLC. |
I
Add a space as the third parameter of the scan function.
How important is it to treat this strange value as a single word and not two?
appl'e
If you don't really need to match the various brackets you can just treat them all as delimiters.
data want ;
set have ;
word=name;
do word_no=1 by 1 until(word=' ');
word =scan(name,word_no,' []{}<>"''');
if word_no=1 or word ne ' ' then output;
end;
run;
The FREQ Procedure Cumulative Cumulative word Frequency Percent Frequency Percent ----------------------------------------------------------- BODY 2 10.00 2 10.00 BR 3 15.00 5 25.00 LTD 1 5.00 6 30.00 a@pple 1 5.00 7 35.00 appl 1 5.00 8 40.00 apple 3 15.00 11 55.00 e 1 5.00 12 60.00 figure 1 5.00 13 65.00 footer 1 5.00 14 70.00 juice 6 30.00 20 100.00
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.