Dear all,
How can I find all strings between (),[],and {} (such as <BR>, [FONT],{BODY},'A',"JUICE") and split them in a new variable?
Especially, for the
'JUICE<BR>apple<footer>',I expect to add a blank between 'JUICE' and 'apple'
by using the following code,
data have ;
infile datalines truncover;
input name $100.;
datalines;JUICE<BR>apple[footer]
juice <BR> apple
juice<BODY> 'apple'
juice{BODY} apple
[BR]juice apple
<figure> "juice" LTD
;
run;
data want1;
set have;
RegExID = prxparse('/<\w*>/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
do while (pos > 0);
html = substr(name, pos, length);
newname=prxchange('s/<\w*>//', -1, name);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run;
I get
name | html | newname |
JUICE<BR>apple<footer> | <BR> | JUICEapple |
JUICE<BR>apple<footer> | <footer> | JUICEapple |
however, I expect to add a blank between 'JUICE' and 'apple'
name | html | newname |
JUICE<BR>apple<footer> | BR | JUICE apple |
JUICE<BR>apple<footer> | footer | JUICE apple |
Could you please give me some suggestions about this?
thanks in advance.
Ah, I see what the problem is. The old Quotes Within Quotes problem 🙂
This
data have ;
infile datalines truncover;
input name $100.;
datalines;
JUICE<BR>apple[footer]
juice <BR> apple
juice<BODY> 'apple'
<figure> "juice" LTD
;
data want;
format name html newname;
set have;
RegExID = prxparse('/<\w*>|\[\w*\]|\(\w*\)|"\w*"|''\w*''/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
newname=prxchange('s/<\w*>|\[\w*\]|\(\w*\)|"\w*"|''\w*''/ /', -1, name);
do while (pos > 0);
html = substr(name, pos+1, length-2);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run;
proc print data=want;
run;
gives you
Just made a very small change to your program in the PRXCHANGE Function. See if this does the trick
data have ;
infile datalines truncover;
input name $100.;
datalines;JUICE<BR>apple[footer]
juice <BR> apple
juice<BODY> 'apple'
juice{BODY} apple
[BR]juice apple
<figure> "juice" LTD ;
run;
data want1;
set have;
RegExID = prxparse('/<\w*>/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
do while (pos > 0);
html = substr(name, pos, length);
newname=prxchange('s/<\w*>/ /', -1, name);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run;
thanks draycut,
but I get
name | html | newname |
JUICE<BR>apple[footer] | <BR> | JUICE apple[footer] |
juice <BR> apple | <BR> | juice apple |
juice<BODY> 'apple' | <BODY> | juice 'apple' |
<figure> "juice" LTD | <figure> | "juice" LTD |
besides,
I cannot get the expected result by following code,
data want1;
set have;
RegExID = prxparse('/<\w*>|[\w*]|{\w*}|'\w*'|"\w*"/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
do while (pos > 0);
html = substr(name, pos, length);
newname=prxchange('s/<\w*>|[\w*]|{\w*}|'\w*'|"\w*"/ /', -1, name);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run
for example,
name |
JUICE<BR>apple[footer] |
juice <BR> apple |
juice<BODY> 'apple' |
<figure> "juice" LTD |
I expect to get
name | html | newname |
JUICE<BR>apple[footer] | BR | JUICE apple |
JUICE<BR>apple[footer] | footer | JUICE apple |
juice <BR> apple | BR | juice apple |
juice<BODY> 'apple' | BODY | juice |
juice<BODY> 'apple' | apple | juice |
<figure> "juice" LTD | figure | LTD |
<figure> "juice" LTD | juice | LTD |
Ah, I see what the problem is. The old Quotes Within Quotes problem 🙂
This
data have ;
infile datalines truncover;
input name $100.;
datalines;
JUICE<BR>apple[footer]
juice <BR> apple
juice<BODY> 'apple'
<figure> "juice" LTD
;
data want;
format name html newname;
set have;
RegExID = prxparse('/<\w*>|\[\w*\]|\(\w*\)|"\w*"|''\w*''/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
newname=prxchange('s/<\w*>|\[\w*\]|\(\w*\)|"\w*"|''\w*''/ /', -1, name);
do while (pos > 0);
html = substr(name, pos+1, length-2);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run;
proc print data=want;
run;
gives you
Hello @PeterClemmensen
I have a new question during the process.
the
HARDY(FRNS.)'A'
cannot be processed by the code
I expect to get
name | COMPANY_NAME_inB | COMPANY_NAME_noB |
HARDY(FRNS.)'A' | FRNS. | HARDY |
HARDY(FRNS.)'A' | A | HARDY |
However, I only get
name | COMPANY_NAME_inB | COMPANY_NAME_noB |
HARDY(FRNS.)'A' | A | HARDY(FRNS.) |
Could you please give me some suggestions?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.