DATA Step, Macro, Functions and more

How to Extract information between line(s) from HTML

Accepted Solution Solved
Reply
Contributor
Posts: 24
Accepted Solution

How to Extract information between line(s) from HTML

Hello everybody,

 

I am trying to extract infomation from a website. As I am very new to SAS, I don't know how to get information/paragraph between lines. I've been thinking of this for sever days already T.T . Please help!!!

 

Question 1: to get usertitle information between <span class="usertitle"> and </span>, and also to specifiy if such information is missing when there is nothing between <span class="usertitle"> and< span style="font-weight: ). 

 

<span class="usertitle">
Member
</span>

......

<span class="usertitle">
Junior Member
</span>

......

<span class="usertitle">
<span style="font-weight: bold; color: black;">Not your guy, fwiend...</span>
</span>

 

Question 2: to extract replied contents between <blockquote class="postcontent restore "> and </blockquote>, and to delete <br /> in the output.

 

<blockquote class="postcontent restore ">
replied content line 1 -- omit details here for brevity<br />
<br />
replied content line 2 -- omit details here for brevity.<br />
<br />
replied content line 3 -- omit details here for brevity.
</blockquote>

......

 

Thank you very much in advance!

 


Accepted Solutions
Solution
‎12-26-2016 07:22 PM
Valued Guide
Posts: 797

Re: How to Extract information between line(s) from HTML

You could read line-by-line and look for '<span class="usertitle">', or you could use the INPUT statement to do it for you as in

 

input @ '<span class="usertitle">' / _line_ :&$200. ;

 

  • The  @ '<span class="usertitle">' says to look for the specified string, even if it goes over serveral lines.
  • The '/' means skip to next line.
  • The remainder says to read in a character variable named _LINE_ of up to 200 chacracters (and the '&' means don't stop before 200 characters if you encounter interior single blanks in the line - so you get "Junior Member" instead of just "Junior").

Then all you have to do is check the contents of the _LINE_ variable for the unwanted markup, and assign USERTITLE accordingly, as below (note the "=:" relation is different from the ordinary "=".  It means compare only the first X characters, where X is the length of the shorter character value)

 

Also you'll be reading from an external file, so use the INFILE statement to point the INPUT operation to the right source.

 

data want;

  length usertitle $20 ;

  input @ '<span class="usertitle">' / _line_ :&$200. ;

  if not (_line_ =: '<span style="font-weight:') then usertitle=_line_;

  drop _line_;

datalines4;

<span class="usertitle">

Member

</span>

<span class="usertitle">

Junior Member

</span>

<span class="usertitle">

<span style="font-weight: bold; color: black;"> Not your guy, fwiend...</span>

</span>

;;;;

run;

 

That solves your first request.  And you can use the same tools to begin solving the second.

View solution in original post


All Replies
Solution
‎12-26-2016 07:22 PM
Valued Guide
Posts: 797

Re: How to Extract information between line(s) from HTML

You could read line-by-line and look for '<span class="usertitle">', or you could use the INPUT statement to do it for you as in

 

input @ '<span class="usertitle">' / _line_ :&$200. ;

 

  • The  @ '<span class="usertitle">' says to look for the specified string, even if it goes over serveral lines.
  • The '/' means skip to next line.
  • The remainder says to read in a character variable named _LINE_ of up to 200 chacracters (and the '&' means don't stop before 200 characters if you encounter interior single blanks in the line - so you get "Junior Member" instead of just "Junior").

Then all you have to do is check the contents of the _LINE_ variable for the unwanted markup, and assign USERTITLE accordingly, as below (note the "=:" relation is different from the ordinary "=".  It means compare only the first X characters, where X is the length of the shorter character value)

 

Also you'll be reading from an external file, so use the INFILE statement to point the INPUT operation to the right source.

 

data want;

  length usertitle $20 ;

  input @ '<span class="usertitle">' / _line_ :&$200. ;

  if not (_line_ =: '<span style="font-weight:') then usertitle=_line_;

  drop _line_;

datalines4;

<span class="usertitle">

Member

</span>

<span class="usertitle">

Junior Member

</span>

<span class="usertitle">

<span style="font-weight: bold; color: black;"> Not your guy, fwiend...</span>

</span>

;;;;

run;

 

That solves your first request.  And you can use the same tools to begin solving the second.

Contributor
Posts: 24

Re: How to Extract information between line(s) from HTML

Hi mkeintz,

 

Thank you for the quick response. What if my data is from an url where there are 25 usertitles? The datalines4; seems not work for me.

Valued Guide
Posts: 797

Re: How to Extract information between line(s) from HTML

When SAS reads data from a series of lines directly following the data step program (rather than from an external file), the DATALINES statement is needed to tell SAS that the program code is ended and the data is about to start.  I should have told you that when you read from an external file, the datalines statement is not needed.   The reason it's DATALINES4 rather then DATALINES is because otherwise SAS will take the first semicolon in the data to indicate end-of-data.  DATALINES4 tells SAS that 4 consecutive semicolons are required to indicate end of data.  (So you can drop the line of 4 semicolons also). 

 

 

Contributor
Posts: 24

Re: How to Extract information between line(s) from HTML

I adjusted a little bit. Works now!!!!!

 

Thank you very much 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 204 views
  • 3 likes
  • 2 in conversation