BookmarkSubscribeRSS Feed
Valentin_HU
Calcite | Level 5

Dear all,

I have the following problem. I have a string variable which contains to some extent text strings that are surrounded by brackets. E.g.:

<P STYLE="MARGIN: 0IN 0IN 0PT"><FONT STYLE="FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">&NBSP;</FONT></P> <P STYLE="MARGIN: 0IN 0IN 0PT"><B><FONT STYLE="FONT-WEIGHT: BOLD; FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">DERIVATIVES AND HEDGING ACTIVITY: </FONT></B><FONT STYLE="FONT-SIZE: 10PT" SIZE="2">THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

Now I want to keep only the text strings that are not surrounded by brackets, here this would be: DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

How can this be executed?

Thanks,

Valentin

4 REPLIES 4
data_null__
Jade | Level 19

You could use a regular expression. Note there is one other text string that is not enclosed in <>

filename FT15F001 temp lrecl=512;

data _null_;

   infile FT15F001;

   if _n_ eq 1 then rx = prxparse('s/<[^>]*>//');

   retain rx;

   input;

   length c $512;

   c = prxchange(rx,100,_infile_);

   putlog c=;

   parmcards4;

<P STYLE="MARGIN: 0IN 0IN 0PT"><FONT STYLE="FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">&NBSP;</FONT></P> <P STYLE="MARGIN: 0IN 0IN 0PT"><B><FONT STYLE="FONT-WEIGHT: BOLD; FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">DERIVATIVES AND HEDGING ACTIVITY: </FONT></B><FONT STYLE="FONT-SIZE: 10PT" SIZE="2">THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

;;;;

   run;

DLing
Obsidian | Level 7

Just to add to data_null_'s answer,

1) s/<[^>]*>// can also be expressed as s/<.*?>//   these two expressions are identical.

The "*?" is the lazy repetition operator that stops as soon as possible, whereas the normal "*" repetition operator is greedy and will match as much as possible.  Difference only show up when there are unbalanced delimiters, e.g.,

<aaa>bbb>   s/<.*>// would eliminate the whole string, s/<.*?>// would eliminate up to the first '>' only and leave bbb> untouched.

2) prxchange( rx, -1, string )    the "-1" would perform rx until the end of string, however many times that happens to be.

FriedEgg
SAS Employee

The remaining string I believe is the "&NBSP" which is a non-collapsing space tag.  You could modify the regular expression to remove this additional tag as well.

s/<[^>]+>|\x26[^<>]*;//

Ksharp
Super User

I wrote a code before, to pull over string from the source code of html just like your situation.

But I do not know whether the code is suited for your situation.

data want(where=(row not in (' ' '&NBSP;')));
 infile datalines dsd  dlm='><' ;
 format row $200.;
 input @'>' row  @@;
datalines4;

&NBSP;

DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS ;;;; run;







Ksharp

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 3641 views
  • 2 likes
  • 5 in conversation