Dear all,
I have the following problem. I have a string variable which contains to some extent text strings that are surrounded by brackets. E.g.:
<P STYLE="MARGIN: 0IN 0IN 0PT"><FONT STYLE="FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">&NBSP;</FONT></P> <P STYLE="MARGIN: 0IN 0IN 0PT"><B><FONT STYLE="FONT-WEIGHT: BOLD; FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">DERIVATIVES AND HEDGING ACTIVITY: </FONT></B><FONT STYLE="FONT-SIZE: 10PT" SIZE="2">THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS
Now I want to keep only the text strings that are not surrounded by brackets, here this would be: DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS
How can this be executed?
Thanks,
Valentin
You could use a regular expression. Note there is one other text string that is not enclosed in <>
filename FT15F001 temp lrecl=512;
data _null_;
infile FT15F001;
if _n_ eq 1 then rx = prxparse('s/<[^>]*>//');
retain rx;
input;
length c $512;
c = prxchange(rx,100,_infile_);
putlog c=;
parmcards4;
<P STYLE="MARGIN: 0IN 0IN 0PT"><FONT STYLE="FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">&NBSP;</FONT></P> <P STYLE="MARGIN: 0IN 0IN 0PT"><B><FONT STYLE="FONT-WEIGHT: BOLD; FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">DERIVATIVES AND HEDGING ACTIVITY: </FONT></B><FONT STYLE="FONT-SIZE: 10PT" SIZE="2">THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS
;;;;
run;
Just to add to data_null_'s answer,
1) s/<[^>]*>// can also be expressed as s/<.*?>// these two expressions are identical.
The "*?" is the lazy repetition operator that stops as soon as possible, whereas the normal "*" repetition operator is greedy and will match as much as possible. Difference only show up when there are unbalanced delimiters, e.g.,
<aaa>bbb> s/<.*>// would eliminate the whole string, s/<.*?>// would eliminate up to the first '>' only and leave bbb> untouched.
2) prxchange( rx, -1, string ) the "-1" would perform rx until the end of string, however many times that happens to be.
The remaining string I believe is the "&NBSP" which is a non-collapsing space tag. You could modify the regular expression to remove this additional tag as well.
s/<[^>]+>|\x26[^<>]*;//
I wrote a code before, to pull over string from the source code of html just like your situation.
But I do not know whether the code is suited for your situation.
data want(where=(row not in (' ' '&NBSP;'))); infile datalines dsd dlm='><' ; format row $200.; input @'>' row @@; datalines4;&NBSP;
DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS ;;;; run;
Ksharp
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.