Solved: find HTML code and remove them

Alexxxxxxx · Posted 03-14-2019 01:29 PM

Dear all,

How can I find all HTML code (such as '<BR>', '<FONT>','<BODY>') and remove them in the variable?

data have ;
  infile datalines truncover;
  input name $100.;
  datalines;
JUICE<BR>apple<footer> 
juice <BR> apple 
juice<BODY>apple 
juice<BODY> apple 
<BR>juice apple
<figure> juice 
;
run;

Could you please give me some suggestions about this?

thanks in advance.

PeterClemmensen · Posted 03-15-2019 03:11 AM

Here is a PRXNEXT example. I have written two different programs. The first outputs an observation for each html found. The second concatenates the found html codes so it has the same number of observations as the input data.

@Alexxxxxxx Let me know if it works for you 🙂

data have ;
  infile datalines truncover;
  input name $100.;
  datalines;
JUICE<BR>apple<footer> 
juice <BR> apple 
juice<BODY>apple 
juice<BODY> apple 
<BR>juice apple
<figure> juice 
;
run;

data want1;
   set have;
   RegExID = prxparse('/<\w*>/');
   start=1;
   call prxnext(RegExID, start, length(name), name, pos, length);
      do while (pos > 0);
         html = substr(name, pos, length);
         newname=prxchange('s/<\w*>//', -1, name);
         output;
         call prxnext(RegExID, start, length(name), name, pos, length);
      end;
   keep name html newname;
run;

data want2;
   set have;
   length html $200;
   RegExID = prxparse('/<\w*>/');
   start=1;
   html="";
   call prxnext(RegExID, start, length(name), name, pos, length);
      do while (pos > 0);
         html = catx(',', html, substr(name, pos, length));
         newname=prxchange('s/<\w*>//', -1, name);
         call prxnext(RegExID, start, length(name), name, pos, length);
      end;
   keep name html newname;
   retain html;
run;

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 03-14-2019 01:41 PM

Something like this?

data have ;
  infile datalines truncover;
  input name $100.;
  datalines;
JUICE<BR>apple<footer> 
juice <BR> apple 
juice<BODY>apple 
juice<BODY> apple 
<BR>juice apple
<figure> juice 
;
run;

data want;
   set have;
   new=prxchange('s/<\w*>//', -1, name);
run;

The DATA to DATA Step Macro
Blog: SASnrd

Alexxxxxxx · Posted 03-14-2019 02:23 PM

Dear draycut,

I appreciate your reply and kind advise.

May I ask one more question, please? How can I find the HTML code ?

thanks for your attention to this matter.

PeterClemmensen · Posted 03-14-2019 05:32 PM

@Alexxxxxxx , when you say HTML Code, do you mean the text inside the <> or including the <>?

Also, what do you want to do with it? Put them in a separate variable or?

The DATA to DATA Step Macro
Blog: SASnrd

Alexxxxxxx · Posted 03-14-2019 10:03 PM

@PeterClemmensen, I mean both the text and the <>. I expect to find them and put them in a separate variable. Could you please give me some suggestions about this?

PeterClemmensen · Posted 03-15-2019 03:11 AM

Here is a PRXNEXT example. I have written two different programs. The first outputs an observation for each html found. The second concatenates the found html codes so it has the same number of observations as the input data.

@Alexxxxxxx Let me know if it works for you 🙂

data have ;
  infile datalines truncover;
  input name $100.;
  datalines;
JUICE<BR>apple<footer> 
juice <BR> apple 
juice<BODY>apple 
juice<BODY> apple 
<BR>juice apple
<figure> juice 
;
run;

data want1;
   set have;
   RegExID = prxparse('/<\w*>/');
   start=1;
   call prxnext(RegExID, start, length(name), name, pos, length);
      do while (pos > 0);
         html = substr(name, pos, length);
         newname=prxchange('s/<\w*>//', -1, name);
         output;
         call prxnext(RegExID, start, length(name), name, pos, length);
      end;
   keep name html newname;
run;

data want2;
   set have;
   length html $200;
   RegExID = prxparse('/<\w*>/');
   start=1;
   html="";
   call prxnext(RegExID, start, length(name), name, pos, length);
      do while (pos > 0);
         html = catx(',', html, substr(name, pos, length));
         newname=prxchange('s/<\w*>//', -1, name);
         call prxnext(RegExID, start, length(name), name, pos, length);
      end;
   keep name html newname;
   retain html;
run;

The DATA to DATA Step Macro
Blog: SASnrd

Alexxxxxxx · Posted 03-18-2019 03:57 AM

Dear draycut,

for the

'JUICE<BR>apple<footer>'

by using the first code,

data want1;
   set have;
   RegExID = prxparse('/<\w*>/');
   start=1;
   call prxnext(RegExID, start, length(name), name, pos, length);
      do while (pos > 0);
         html = substr(name, pos, length);
         newname=prxchange('s/<\w*>//', -1, name);
         output;
         call prxnext(RegExID, start, length(name), name, pos, length);
      end;
   keep name html newname;
run;

I get

name	html	newname
JUICE<BR>apple<footer>	<BR>	JUICEapple
JUICE<BR>apple<footer>	<footer>	JUICEapple

however, I expect to add a blank between 'JUICE' and 'apple'

name	html	newname
JUICE<BR>apple<footer>	<BR>	JUICE apple
JUICE<BR>apple<footer>	<footer>	JUICE apple

Could you please give me some suggestions about this?

andreas_lds · Posted 03-14-2019 05:24 PM

Have you tried the code posted by @PeterClemmensen?

Ksharp · Posted 03-15-2019 10:41 AM

data have ;
  infile datalines truncover;
  input name $100.;
  datalines;
JUICE<BR>apple<footer> 
juice <BR> apple 
juice<BODY>apple 
juice<BODY> apple 
<BR>juice apple
<figure> juice 
;
run;

data want;
   set have;
   new=prxchange('s/<.*?>/ /', -1, name);
run;

find HTML code and remove them

Re: find HTML code and remove them

Re: find HTML code and remove them

Re: find HTML code and remove them

Re: find HTML code and remove them

Re: find HTML code and remove them

Re: find HTML code and remove them

Re: find HTML code and remove them

Re: find HTML code and remove them

Re: find HTML code and remove them

SAS Innovate 2025: Call for Content

Classroom Training Available!