Dear all,
How can I find all strings between (),[],and {} (such as <BR>, [FONT],{BODY},'A',"JUICE") and split them in a new variable?
Especially, for the
'JUICE<BR>apple<footer>',I expect to add a blank between 'JUICE' and 'apple'
by using the following code,
data have ;
infile datalines truncover;
input name $100.;
datalines;JUICE<BR>apple[footer]
juice <BR> apple
juice<BODY> 'apple'
juice{BODY} apple
[BR]juice apple
<figure> "juice" LTD
;
run;
data want1;
set have;
RegExID = prxparse('/<\w*>/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
do while (pos > 0);
html = substr(name, pos, length);
newname=prxchange('s/<\w*>//', -1, name);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run;
I get
name | html | newname |
JUICE<BR>apple<footer> | <BR> | JUICEapple |
JUICE<BR>apple<footer> | <footer> | JUICEapple |
however, I expect to add a blank between 'JUICE' and 'apple'
name | html | newname |
JUICE<BR>apple<footer> | BR | JUICE apple |
JUICE<BR>apple<footer> | footer | JUICE apple |
Could you please give me some suggestions about this?
thanks in advance.
Ah, I see what the problem is. The old Quotes Within Quotes problem 🙂
This
data have ;
infile datalines truncover;
input name $100.;
datalines;
JUICE<BR>apple[footer]
juice <BR> apple
juice<BODY> 'apple'
<figure> "juice" LTD
;
data want;
format name html newname;
set have;
RegExID = prxparse('/<\w*>|\[\w*\]|\(\w*\)|"\w*"|''\w*''/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
newname=prxchange('s/<\w*>|\[\w*\]|\(\w*\)|"\w*"|''\w*''/ /', -1, name);
do while (pos > 0);
html = substr(name, pos+1, length-2);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run;
proc print data=want;
run;
gives you
Just made a very small change to your program in the PRXCHANGE Function. See if this does the trick
data have ;
infile datalines truncover;
input name $100.;
datalines;JUICE<BR>apple[footer]
juice <BR> apple
juice<BODY> 'apple'
juice{BODY} apple
[BR]juice apple
<figure> "juice" LTD ;
run;
data want1;
set have;
RegExID = prxparse('/<\w*>/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
do while (pos > 0);
html = substr(name, pos, length);
newname=prxchange('s/<\w*>/ /', -1, name);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run;
thanks draycut,
but I get
name | html | newname |
JUICE<BR>apple[footer] | <BR> | JUICE apple[footer] |
juice <BR> apple | <BR> | juice apple |
juice<BODY> 'apple' | <BODY> | juice 'apple' |
<figure> "juice" LTD | <figure> | "juice" LTD |
besides,
I cannot get the expected result by following code,
data want1;
set have;
RegExID = prxparse('/<\w*>|[\w*]|{\w*}|'\w*'|"\w*"/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
do while (pos > 0);
html = substr(name, pos, length);
newname=prxchange('s/<\w*>|[\w*]|{\w*}|'\w*'|"\w*"/ /', -1, name);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run
for example,
name |
JUICE<BR>apple[footer] |
juice <BR> apple |
juice<BODY> 'apple' |
<figure> "juice" LTD |
I expect to get
name | html | newname |
JUICE<BR>apple[footer] | BR | JUICE apple |
JUICE<BR>apple[footer] | footer | JUICE apple |
juice <BR> apple | BR | juice apple |
juice<BODY> 'apple' | BODY | juice |
juice<BODY> 'apple' | apple | juice |
<figure> "juice" LTD | figure | LTD |
<figure> "juice" LTD | juice | LTD |
Ah, I see what the problem is. The old Quotes Within Quotes problem 🙂
This
data have ;
infile datalines truncover;
input name $100.;
datalines;
JUICE<BR>apple[footer]
juice <BR> apple
juice<BODY> 'apple'
<figure> "juice" LTD
;
data want;
format name html newname;
set have;
RegExID = prxparse('/<\w*>|\[\w*\]|\(\w*\)|"\w*"|''\w*''/');
start=1;
call prxnext(RegExID, start, length(name), name, pos, length);
newname=prxchange('s/<\w*>|\[\w*\]|\(\w*\)|"\w*"|''\w*''/ /', -1, name);
do while (pos > 0);
html = substr(name, pos+1, length-2);
output;
call prxnext(RegExID, start, length(name), name, pos, length);
end;
keep name html newname;
run;
proc print data=want;
run;
gives you
Hello @PeterClemmensen
I have a new question during the process.
the
HARDY(FRNS.)'A'
cannot be processed by the code
I expect to get
name | COMPANY_NAME_inB | COMPANY_NAME_noB |
HARDY(FRNS.)'A' | FRNS. | HARDY |
HARDY(FRNS.)'A' | A | HARDY |
However, I only get
name | COMPANY_NAME_inB | COMPANY_NAME_noB |
HARDY(FRNS.)'A' | A | HARDY(FRNS.) |
Could you please give me some suggestions?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.