Hello everybody,
I am trying to extract infomation from a website. As I am very new to SAS, I don't know how to get information/paragraph between lines. I've been thinking of this for sever days already T.T . Please help!!!
Question 1: to get usertitle information between <span class="usertitle"> and </span>, and also to specifiy if such information is missing when there is nothing between <span class="usertitle"> and< span style="font-weight: ).
<span class="usertitle">
Member
</span>
......
<span class="usertitle">
Junior Member
</span>
......
<span class="usertitle">
<span style="font-weight: bold; color: black;">Not your guy, fwiend...</span>
</span>
Question 2: to extract replied contents between <blockquote class="postcontent restore "> and </blockquote>, and to delete <br /> in the output.
<blockquote class="postcontent restore ">
replied content line 1 -- omit details here for brevity<br />
<br />
replied content line 2 -- omit details here for brevity.<br />
<br />
replied content line 3 -- omit details here for brevity.
</blockquote>
......
Thank you very much in advance!
You could read line-by-line and look for '<span class="usertitle">', or you could use the INPUT statement to do it for you as in
input @ '<span class="usertitle">' / _line_ :&$200. ;
Then all you have to do is check the contents of the _LINE_ variable for the unwanted markup, and assign USERTITLE accordingly, as below (note the "=:" relation is different from the ordinary "=". It means compare only the first X characters, where X is the length of the shorter character value)
Also you'll be reading from an external file, so use the INFILE statement to point the INPUT operation to the right source.
data want;
length usertitle $20 ;
input @ '<span class="usertitle">' / _line_ :&$200. ;
if not (_line_ =: '<span style="font-weight:') then usertitle=_line_;
drop _line_;
datalines4;
<span class="usertitle">
Member
</span>
<span class="usertitle">
Junior Member
</span>
<span class="usertitle">
<span style="font-weight: bold; color: black;"> Not your guy, fwiend...</span>
</span>
;;;;
run;
That solves your first request. And you can use the same tools to begin solving the second.
You could read line-by-line and look for '<span class="usertitle">', or you could use the INPUT statement to do it for you as in
input @ '<span class="usertitle">' / _line_ :&$200. ;
Then all you have to do is check the contents of the _LINE_ variable for the unwanted markup, and assign USERTITLE accordingly, as below (note the "=:" relation is different from the ordinary "=". It means compare only the first X characters, where X is the length of the shorter character value)
Also you'll be reading from an external file, so use the INFILE statement to point the INPUT operation to the right source.
data want;
length usertitle $20 ;
input @ '<span class="usertitle">' / _line_ :&$200. ;
if not (_line_ =: '<span style="font-weight:') then usertitle=_line_;
drop _line_;
datalines4;
<span class="usertitle">
Member
</span>
<span class="usertitle">
Junior Member
</span>
<span class="usertitle">
<span style="font-weight: bold; color: black;"> Not your guy, fwiend...</span>
</span>
;;;;
run;
That solves your first request. And you can use the same tools to begin solving the second.
Hi mkeintz,
Thank you for the quick response. What if my data is from an url where there are 25 usertitles? The datalines4; seems not work for me.
When SAS reads data from a series of lines directly following the data step program (rather than from an external file), the DATALINES statement is needed to tell SAS that the program code is ended and the data is about to start. I should have told you that when you read from an external file, the datalines statement is not needed. The reason it's DATALINES4 rather then DATALINES is because otherwise SAS will take the first semicolon in the data to indicate end-of-data. DATALINES4 tells SAS that 4 consecutive semicolons are required to indicate end of data. (So you can drop the line of 4 semicolons also).
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.