BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
may0423
Obsidian | Level 7

Hello everybody,

 

I am trying to extract infomation from a website. As I am very new to SAS, I don't know how to get information/paragraph between lines. I've been thinking of this for sever days already T.T . Please help!!!

 

Question 1: to get usertitle information between <span class="usertitle"> and </span>, and also to specifiy if such information is missing when there is nothing between <span class="usertitle"> and< span style="font-weight: ). 

 

<span class="usertitle">
Member
</span>

......

<span class="usertitle">
Junior Member
</span>

......

<span class="usertitle">
<span style="font-weight: bold; color: black;">Not your guy, fwiend...</span>
</span>

 

Question 2: to extract replied contents between <blockquote class="postcontent restore "> and </blockquote>, and to delete <br /> in the output.

 

<blockquote class="postcontent restore ">
replied content line 1 -- omit details here for brevity<br />
<br />
replied content line 2 -- omit details here for brevity.<br />
<br />
replied content line 3 -- omit details here for brevity.
</blockquote>

......

 

Thank you very much in advance!

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You could read line-by-line and look for '<span class="usertitle">', or you could use the INPUT statement to do it for you as in

 

input @ '<span class="usertitle">' / _line_ :&$200. ;

 

  • The  @ '<span class="usertitle">' says to look for the specified string, even if it goes over serveral lines.
  • The '/' means skip to next line.
  • The remainder says to read in a character variable named _LINE_ of up to 200 chacracters (and the '&' means don't stop before 200 characters if you encounter interior single blanks in the line - so you get "Junior Member" instead of just "Junior").

Then all you have to do is check the contents of the _LINE_ variable for the unwanted markup, and assign USERTITLE accordingly, as below (note the "=:" relation is different from the ordinary "=".  It means compare only the first X characters, where X is the length of the shorter character value)

 

Also you'll be reading from an external file, so use the INFILE statement to point the INPUT operation to the right source.

 

data want;

  length usertitle $20 ;

  input @ '<span class="usertitle">' / _line_ :&$200. ;

  if not (_line_ =: '<span style="font-weight:') then usertitle=_line_;

  drop _line_;

datalines4;

<span class="usertitle">

Member

</span>

<span class="usertitle">

Junior Member

</span>

<span class="usertitle">

<span style="font-weight: bold; color: black;"> Not your guy, fwiend...</span>

</span>

;;;;

run;

 

That solves your first request.  And you can use the same tools to begin solving the second.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

4 REPLIES 4
mkeintz
PROC Star

You could read line-by-line and look for '<span class="usertitle">', or you could use the INPUT statement to do it for you as in

 

input @ '<span class="usertitle">' / _line_ :&$200. ;

 

  • The  @ '<span class="usertitle">' says to look for the specified string, even if it goes over serveral lines.
  • The '/' means skip to next line.
  • The remainder says to read in a character variable named _LINE_ of up to 200 chacracters (and the '&' means don't stop before 200 characters if you encounter interior single blanks in the line - so you get "Junior Member" instead of just "Junior").

Then all you have to do is check the contents of the _LINE_ variable for the unwanted markup, and assign USERTITLE accordingly, as below (note the "=:" relation is different from the ordinary "=".  It means compare only the first X characters, where X is the length of the shorter character value)

 

Also you'll be reading from an external file, so use the INFILE statement to point the INPUT operation to the right source.

 

data want;

  length usertitle $20 ;

  input @ '<span class="usertitle">' / _line_ :&$200. ;

  if not (_line_ =: '<span style="font-weight:') then usertitle=_line_;

  drop _line_;

datalines4;

<span class="usertitle">

Member

</span>

<span class="usertitle">

Junior Member

</span>

<span class="usertitle">

<span style="font-weight: bold; color: black;"> Not your guy, fwiend...</span>

</span>

;;;;

run;

 

That solves your first request.  And you can use the same tools to begin solving the second.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
may0423
Obsidian | Level 7

Hi mkeintz,

 

Thank you for the quick response. What if my data is from an url where there are 25 usertitles? The datalines4; seems not work for me.

mkeintz
PROC Star

When SAS reads data from a series of lines directly following the data step program (rather than from an external file), the DATALINES statement is needed to tell SAS that the program code is ended and the data is about to start.  I should have told you that when you read from an external file, the datalines statement is not needed.   The reason it's DATALINES4 rather then DATALINES is because otherwise SAS will take the first semicolon in the data to indicate end-of-data.  DATALINES4 tells SAS that 4 consecutive semicolons are required to indicate end of data.  (So you can drop the line of 4 semicolons also). 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
may0423
Obsidian | Level 7

I adjusted a little bit. Works now!!!!!

 

Thank you very much 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 872 views
  • 3 likes
  • 2 in conversation