Dear all,
How can I find all strings between (),[],and {} (such as <BR>, [FONT],{BODY},'A',"JUICE") and split them in a new variable?
Especially, for the
'JUICE<BR>apple<footer>',I expect to add a blank between 'JUICE' and 'apple'
by using the following code,
data have ;
infile datalines truncover;
input name $100.;
datalines;JUICE<BR>apple[footer]
juice <BR> apple
juice<BODY> 'apple'
juice{BODY} apple
[BR]juice apple
<figure> "juice" LTD
;
run;
data want1;
set have;
RegExID = prxparse('/<\w*>/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
do while (pos > 0);
html = substr(name, pos, length);
newname=prxchange('s/<\w*>//', -1, name);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run;
I get
name | html | newname |
JUICE<BR>apple<footer> | <BR> | JUICEapple |
JUICE<BR>apple<footer> | <footer> | JUICEapple |
however, I expect to add a blank between 'JUICE' and 'apple'
name | html | newname |
JUICE<BR>apple<footer> | BR | JUICE apple |
JUICE<BR>apple<footer> | footer | JUICE apple |
Could you please give me some suggestions about this?
thanks in advance.
Ah, I see what the problem is. The old Quotes Within Quotes problem 🙂
This
data have ;
infile datalines truncover;
input name $100.;
datalines;
JUICE<BR>apple[footer]
juice <BR> apple
juice<BODY> 'apple'
<figure> "juice" LTD
;
data want;
format name html newname;
set have;
RegExID = prxparse('/<\w*>|\[\w*\]|\(\w*\)|"\w*"|''\w*''/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
newname=prxchange('s/<\w*>|\[\w*\]|\(\w*\)|"\w*"|''\w*''/ /', -1, name);
do while (pos > 0);
html = substr(name, pos+1, length-2);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run;
proc print data=want;
run;
gives you
Just made a very small change to your program in the PRXCHANGE Function. See if this does the trick
data have ;
infile datalines truncover;
input name $100.;
datalines;JUICE<BR>apple[footer]
juice <BR> apple
juice<BODY> 'apple'
juice{BODY} apple
[BR]juice apple
<figure> "juice" LTD ;
run;
data want1;
set have;
RegExID = prxparse('/<\w*>/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
do while (pos > 0);
html = substr(name, pos, length);
newname=prxchange('s/<\w*>/ /', -1, name);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run;
thanks draycut,
but I get
name | html | newname |
JUICE<BR>apple[footer] | <BR> | JUICE apple[footer] |
juice <BR> apple | <BR> | juice apple |
juice<BODY> 'apple' | <BODY> | juice 'apple' |
<figure> "juice" LTD | <figure> | "juice" LTD |
besides,
I cannot get the expected result by following code,
data want1;
set have;
RegExID = prxparse('/<\w*>|[\w*]|{\w*}|'\w*'|"\w*"/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
do while (pos > 0);
html = substr(name, pos, length);
newname=prxchange('s/<\w*>|[\w*]|{\w*}|'\w*'|"\w*"/ /', -1, name);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run
for example,
name |
JUICE<BR>apple[footer] |
juice <BR> apple |
juice<BODY> 'apple' |
<figure> "juice" LTD |
I expect to get
name | html | newname |
JUICE<BR>apple[footer] | BR | JUICE apple |
JUICE<BR>apple[footer] | footer | JUICE apple |
juice <BR> apple | BR | juice apple |
juice<BODY> 'apple' | BODY | juice |
juice<BODY> 'apple' | apple | juice |
<figure> "juice" LTD | figure | LTD |
<figure> "juice" LTD | juice | LTD |
Ah, I see what the problem is. The old Quotes Within Quotes problem 🙂
This
data have ;
infile datalines truncover;
input name $100.;
datalines;
JUICE<BR>apple[footer]
juice <BR> apple
juice<BODY> 'apple'
<figure> "juice" LTD
;
data want;
format name html newname;
set have;
RegExID = prxparse('/<\w*>|\[\w*\]|\(\w*\)|"\w*"|''\w*''/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
newname=prxchange('s/<\w*>|\[\w*\]|\(\w*\)|"\w*"|''\w*''/ /', -1, name);
do while (pos > 0);
html = substr(name, pos+1, length-2);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run;
proc print data=want;
run;
gives you
Hello @PeterClemmensen
I have a new question during the process.
the
HARDY(FRNS.)'A'
cannot be processed by the code
I expect to get
name | COMPANY_NAME_inB | COMPANY_NAME_noB |
HARDY(FRNS.)'A' | FRNS. | HARDY |
HARDY(FRNS.)'A' | A | HARDY |
However, I only get
name | COMPANY_NAME_inB | COMPANY_NAME_noB |
HARDY(FRNS.)'A' | A | HARDY(FRNS.) |
Could you please give me some suggestions?
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.