Dear all,
I have the following problem. I have a string variable which contains to some extent text strings that are surrounded by brackets. E.g.:
<P STYLE="MARGIN: 0IN 0IN 0PT"><FONT STYLE="FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">&NBSP;</FONT></P> <P STYLE="MARGIN: 0IN 0IN 0PT"><B><FONT STYLE="FONT-WEIGHT: BOLD; FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">DERIVATIVES AND HEDGING ACTIVITY: </FONT></B><FONT STYLE="FONT-SIZE: 10PT" SIZE="2">THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS
Now I want to keep only the text strings that are not surrounded by brackets, here this would be: DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS
How can this be executed?
Thanks,
Valentin
You could use a regular expression. Note there is one other text string that is not enclosed in <>
filename FT15F001 temp lrecl=512;
data _null_;
infile FT15F001;
if _n_ eq 1 then rx = prxparse('s/<[^>]*>//');
retain rx;
input;
length c $512;
c = prxchange(rx,100,_infile_);
putlog c=;
parmcards4;
<P STYLE="MARGIN: 0IN 0IN 0PT"><FONT STYLE="FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">&NBSP;</FONT></P> <P STYLE="MARGIN: 0IN 0IN 0PT"><B><FONT STYLE="FONT-WEIGHT: BOLD; FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">DERIVATIVES AND HEDGING ACTIVITY: </FONT></B><FONT STYLE="FONT-SIZE: 10PT" SIZE="2">THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS
;;;;
run;
Just to add to data_null_'s answer,
1) s/<[^>]*>// can also be expressed as s/<.*?>// these two expressions are identical.
The "*?" is the lazy repetition operator that stops as soon as possible, whereas the normal "*" repetition operator is greedy and will match as much as possible. Difference only show up when there are unbalanced delimiters, e.g.,
<aaa>bbb> s/<.*>// would eliminate the whole string, s/<.*?>// would eliminate up to the first '>' only and leave bbb> untouched.
2) prxchange( rx, -1, string ) the "-1" would perform rx until the end of string, however many times that happens to be.
The remaining string I believe is the "&NBSP" which is a non-collapsing space tag. You could modify the regular expression to remove this additional tag as well.
s/<[^>]+>|\x26[^<>]*;//
I wrote a code before, to pull over string from the source code of html just like your situation.
But I do not know whether the code is suited for your situation.
data want(where=(row not in (' ' '&NBSP;'))); infile datalines dsd dlm='><' ; format row $200.; input @'>' row @@; datalines4;&NBSP;
DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS ;;;; run;
Ksharp
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.