DATA Step, Macro, Functions and more

Delete text between brackets

Reply
Occasional Contributor
Posts: 9

Delete text between brackets

Dear all,

I have the following problem. I have a string variable which contains to some extent text strings that are surrounded by brackets. E.g.:

<P STYLE="MARGIN: 0IN 0IN 0PT"><FONT STYLE="FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">&NBSP;</FONT></P> <P STYLE="MARGIN: 0IN 0IN 0PT"><B><FONT STYLE="FONT-WEIGHT: BOLD; FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">DERIVATIVES AND HEDGING ACTIVITY: </FONT></B><FONT STYLE="FONT-SIZE: 10PT" SIZE="2">THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

Now I want to keep only the text strings that are not surrounded by brackets, here this would be: DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

How can this be executed?

Thanks,

Valentin

Respected Advisor
Posts: 3,799

Re: Delete text between brackets

Posted in reply to Valentin_HU

You could use a regular expression. Note there is one other text string that is not enclosed in <>

filename FT15F001 temp lrecl=512;

data _null_;

   infile FT15F001;

   if _n_ eq 1 then rx = prxparse('s/<[^>]*>//');

   retain rx;

   input;

   length c $512;

   c = prxchange(rx,100,_infile_);

   putlog c=;

   parmcards4;

<P STYLE="MARGIN: 0IN 0IN 0PT"><FONT STYLE="FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">&NBSP;</FONT></P> <P STYLE="MARGIN: 0IN 0IN 0PT"><B><FONT STYLE="FONT-WEIGHT: BOLD; FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">DERIVATIVES AND HEDGING ACTIVITY: </FONT></B><FONT STYLE="FONT-SIZE: 10PT" SIZE="2">THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

;;;;

   run;

Frequent Contributor
Posts: 104

Re: Delete text between brackets

Posted in reply to data_null__

Just to add to data_null_'s answer,

1) s/<[^>]*>// can also be expressed as s/<.*?>//   these two expressions are identical.

The "*?" is the lazy repetition operator that stops as soon as possible, whereas the normal "*" repetition operator is greedy and will match as much as possible.  Difference only show up when there are unbalanced delimiters, e.g.,

<aaa>bbb>   s/<.*>// would eliminate the whole string, s/<.*?>// would eliminate up to the first '>' only and leave bbb> untouched.

2) prxchange( rx, -1, string )    the "-1" would perform rx until the end of string, however many times that happens to be.

Trusted Advisor
Posts: 1,301

Re: Delete text between brackets

Posted in reply to data_null__

The remaining string I believe is the "&NBSP" which is a non-collapsing space tag.  You could modify the regular expression to remove this additional tag as well.

s/<[^>]+>|\x26[^<>]*;//

Super User
Posts: 10,046

Re: Delete text between brackets

Posted in reply to Valentin_HU

I wrote a code before, to pull over string from the source code of html just like your situation.

But I do not know whether the code is suited for your situation.

data want(where=(row not in (' ' '&NBSP;')));
 infile datalines dsd  dlm='><' ;
 format row $200.;
 input @'>' row  @@;
datalines4;

&NBSP;

DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS ;;;; run;







Ksharp

Ask a Question
Discussion stats
  • 4 replies
  • 1305 views
  • 2 likes
  • 5 in conversation