BookmarkSubscribeRSS Feed
Valentin_HU
Calcite | Level 5

Dear all,

I have the following problem. I have a string variable which contains to some extent text strings that are surrounded by brackets. E.g.:

<P STYLE="MARGIN: 0IN 0IN 0PT"><FONT STYLE="FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">&NBSP;</FONT></P> <P STYLE="MARGIN: 0IN 0IN 0PT"><B><FONT STYLE="FONT-WEIGHT: BOLD; FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">DERIVATIVES AND HEDGING ACTIVITY: </FONT></B><FONT STYLE="FONT-SIZE: 10PT" SIZE="2">THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

Now I want to keep only the text strings that are not surrounded by brackets, here this would be: DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

How can this be executed?

Thanks,

Valentin

4 REPLIES 4
data_null__
Jade | Level 19

You could use a regular expression. Note there is one other text string that is not enclosed in <>

filename FT15F001 temp lrecl=512;

data _null_;

   infile FT15F001;

   if _n_ eq 1 then rx = prxparse('s/<[^>]*>//');

   retain rx;

   input;

   length c $512;

   c = prxchange(rx,100,_infile_);

   putlog c=;

   parmcards4;

<P STYLE="MARGIN: 0IN 0IN 0PT"><FONT STYLE="FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">&NBSP;</FONT></P> <P STYLE="MARGIN: 0IN 0IN 0PT"><B><FONT STYLE="FONT-WEIGHT: BOLD; FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">DERIVATIVES AND HEDGING ACTIVITY: </FONT></B><FONT STYLE="FONT-SIZE: 10PT" SIZE="2">THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

;;;;

   run;

DLing
Obsidian | Level 7

Just to add to data_null_'s answer,

1) s/<[^>]*>// can also be expressed as s/<.*?>//   these two expressions are identical.

The "*?" is the lazy repetition operator that stops as soon as possible, whereas the normal "*" repetition operator is greedy and will match as much as possible.  Difference only show up when there are unbalanced delimiters, e.g.,

<aaa>bbb>   s/<.*>// would eliminate the whole string, s/<.*?>// would eliminate up to the first '>' only and leave bbb> untouched.

2) prxchange( rx, -1, string )    the "-1" would perform rx until the end of string, however many times that happens to be.

FriedEgg
SAS Employee

The remaining string I believe is the "&NBSP" which is a non-collapsing space tag.  You could modify the regular expression to remove this additional tag as well.

s/<[^>]+>|\x26[^<>]*;//

Ksharp
Super User

I wrote a code before, to pull over string from the source code of html just like your situation.

But I do not know whether the code is suited for your situation.

data want(where=(row not in (' ' '&NBSP;')));
 infile datalines dsd  dlm='><' ;
 format row $200.;
 input @'>' row  @@;
datalines4;

&NBSP;

DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS ;;;; run;







Ksharp

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 3341 views
  • 2 likes
  • 5 in conversation