BookmarkSubscribeRSS Feed
Valentin_HU
Calcite | Level 5

Dear all,

I have the following problem. I have a string variable which contains to some extent text strings that are surrounded by brackets. E.g.:

<P STYLE="MARGIN: 0IN 0IN 0PT"><FONT STYLE="FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">&NBSP;</FONT></P> <P STYLE="MARGIN: 0IN 0IN 0PT"><B><FONT STYLE="FONT-WEIGHT: BOLD; FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">DERIVATIVES AND HEDGING ACTIVITY: </FONT></B><FONT STYLE="FONT-SIZE: 10PT" SIZE="2">THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

Now I want to keep only the text strings that are not surrounded by brackets, here this would be: DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

How can this be executed?

Thanks,

Valentin

4 REPLIES 4
data_null__
Jade | Level 19

You could use a regular expression. Note there is one other text string that is not enclosed in <>

filename FT15F001 temp lrecl=512;

data _null_;

   infile FT15F001;

   if _n_ eq 1 then rx = prxparse('s/<[^>]*>//');

   retain rx;

   input;

   length c $512;

   c = prxchange(rx,100,_infile_);

   putlog c=;

   parmcards4;

<P STYLE="MARGIN: 0IN 0IN 0PT"><FONT STYLE="FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">&NBSP;</FONT></P> <P STYLE="MARGIN: 0IN 0IN 0PT"><B><FONT STYLE="FONT-WEIGHT: BOLD; FONT-SIZE: 10PT; FONT-FAMILY: TIMES NEW ROMAN" SIZE="2">DERIVATIVES AND HEDGING ACTIVITY: </FONT></B><FONT STYLE="FONT-SIZE: 10PT" SIZE="2">THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS

;;;;

   run;

DLing
Obsidian | Level 7

Just to add to data_null_'s answer,

1) s/<[^>]*>// can also be expressed as s/<.*?>//   these two expressions are identical.

The "*?" is the lazy repetition operator that stops as soon as possible, whereas the normal "*" repetition operator is greedy and will match as much as possible.  Difference only show up when there are unbalanced delimiters, e.g.,

<aaa>bbb>   s/<.*>// would eliminate the whole string, s/<.*?>// would eliminate up to the first '>' only and leave bbb> untouched.

2) prxchange( rx, -1, string )    the "-1" would perform rx until the end of string, however many times that happens to be.

FriedEgg
SAS Employee

The remaining string I believe is the "&NBSP" which is a non-collapsing space tag.  You could modify the regular expression to remove this additional tag as well.

s/<[^>]+>|\x26[^<>]*;//

Ksharp
Super User

I wrote a code before, to pull over string from the source code of html just like your situation.

But I do not know whether the code is suited for your situation.

data want(where=(row not in (' ' '&NBSP;')));
 infile datalines dsd  dlm='><' ;
 format row $200.;
 input @'>' row  @@;
datalines4;

&NBSP;

DERIVATIVES AND HEDGING ACTIVITY: THE COMPANY USES COMMODITY AND CURRENCY POSITIONS TO MANAGE ITS EXPOSURE TO PRICE FLUCTUATIONS IN THOSE MARKETS ;;;; run;







Ksharp

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 3368 views
  • 2 likes
  • 5 in conversation