Re: Delete text between brackets

Valentin_HU · Posted 08-24-2011 04:37 AM

Dear all,

I have the following problem. I have a string variable which contains to some extent text strings that are surrounded by brackets. E.g.:

&NBSP; DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

Now I want to keep only the text strings that are not surrounded by brackets, here this would be: DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

How can this be executed?

Thanks,

Valentin

data_null__ · Posted 08-24-2011 07:49 AM

You could use a regular expression. Note there is one other text string that is not enclosed in <>

filename FT15F001 temp lrecl=512;
data _null_;
   infile FT15F001;
   if _n_ eq 1 then rx = prxparse('s/<[^>]*>//');
   retain rx;
   input;
   length c $512;
   c = prxchange(rx,100,_infile_);
   putlog c=;
   parmcards4;
<P STYLE="MARGIN: 0IN 0IN 0PT"><FONT STYLE="FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">&NBSP;</FONT></P> <P STYLE="MARGIN: 0IN 0IN 0PT"><B><FONT STYLE="FONT-WEIGHT: BOLD; FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">DERIVATIVES AND HEDGING ACTIVITY: </FONT></B><FONT STYLE="FONT-SIZE: 10PT" SIZE="2">THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS
;;;;
   run;

DLing · Posted 08-24-2011 11:51 AM

Just to add to data_null_'s answer,

1) s/<[^>]*>// can also be expressed as s/<.*?>// these two expressions are identical.

The "*?" is the lazy repetition operator that stops as soon as possible, whereas the normal "*" repetition operator is greedy and will match as much as possible. Difference only show up when there are unbalanced delimiters, e.g.,

<aaa>bbb> s/<.*>// would eliminate the whole string, s/<.*?>// would eliminate up to the first '>' only and leave bbb> untouched.

2) prxchange( rx, -1, string ) the "-1" would perform rx until the end of string, however many times that happens to be.

FriedEgg · Posted 08-24-2011 12:04 PM

The remaining string I believe is the "&NBSP" which is a non-collapsing space tag. You could modify the regular expression to remove this additional tag as well.

s/<[^>]+>|\x26[^<>]*;//

Ksharp · Posted 08-25-2011 02:02 AM

I wrote a code before, to pull over string from the source code of html just like your situation.

But I do not know whether the code is suited for your situation.

data want(where=(row not in (' ' '&NBSP;')));
 infile datalines dsd  dlm='><' ;
 format row $200.;
 input @'>' row  @@;
datalines4;
&NBSP;
DERIVATIVES AND HEDGING ACTIVITY: 
THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS
;;;;
run;

Ksharp

Delete text between brackets