Hello everybody,
I am trying to extract infomation from a website. As I am very new to SAS, I don't know how to get information/paragraph between lines. I've been thinking of this for sever days already T.T . Please help!!!
Question 1: to get usertitle information between <span class="usertitle"> and </span>, and also to specifiy if such information is missing when there is nothing between <span class="usertitle"> and< span style="font-weight: ).
<span class="usertitle">
Member
</span>
......
<span class="usertitle">
Junior Member
</span>
......
<span class="usertitle">
<span style="font-weight: bold; color: black;">Not your guy, fwiend...</span>
</span>
Question 2: to extract replied contents between <blockquote class="postcontent restore "> and </blockquote>, and to delete <br /> in the output.
<blockquote class="postcontent restore ">
replied content line 1 -- omit details here for brevity<br />
<br />
replied content line 2 -- omit details here for brevity.<br />
<br />
replied content line 3 -- omit details here for brevity.
</blockquote>
......
Thank you very much in advance!
You could read line-by-line and look for '<span class="usertitle">', or you could use the INPUT statement to do it for you as in
input @ '<span class="usertitle">' / _line_ :&$200. ;
Then all you have to do is check the contents of the _LINE_ variable for the unwanted markup, and assign USERTITLE accordingly, as below (note the "=:" relation is different from the ordinary "=". It means compare only the first X characters, where X is the length of the shorter character value)
Also you'll be reading from an external file, so use the INFILE statement to point the INPUT operation to the right source.
data want;
length usertitle $20 ;
input @ '<span class="usertitle">' / _line_ :&$200. ;
if not (_line_ =: '<span style="font-weight:') then usertitle=_line_;
drop _line_;
datalines4;
<span class="usertitle">
Member
</span>
<span class="usertitle">
Junior Member
</span>
<span class="usertitle">
<span style="font-weight: bold; color: black;"> Not your guy, fwiend...</span>
</span>
;;;;
run;
That solves your first request. And you can use the same tools to begin solving the second.
You could read line-by-line and look for '<span class="usertitle">', or you could use the INPUT statement to do it for you as in
input @ '<span class="usertitle">' / _line_ :&$200. ;
Then all you have to do is check the contents of the _LINE_ variable for the unwanted markup, and assign USERTITLE accordingly, as below (note the "=:" relation is different from the ordinary "=". It means compare only the first X characters, where X is the length of the shorter character value)
Also you'll be reading from an external file, so use the INFILE statement to point the INPUT operation to the right source.
data want;
length usertitle $20 ;
input @ '<span class="usertitle">' / _line_ :&$200. ;
if not (_line_ =: '<span style="font-weight:') then usertitle=_line_;
drop _line_;
datalines4;
<span class="usertitle">
Member
</span>
<span class="usertitle">
Junior Member
</span>
<span class="usertitle">
<span style="font-weight: bold; color: black;"> Not your guy, fwiend...</span>
</span>
;;;;
run;
That solves your first request. And you can use the same tools to begin solving the second.
Hi mkeintz,
Thank you for the quick response. What if my data is from an url where there are 25 usertitles? The datalines4; seems not work for me.
When SAS reads data from a series of lines directly following the data step program (rather than from an external file), the DATALINES statement is needed to tell SAS that the program code is ended and the data is about to start. I should have told you that when you read from an external file, the datalines statement is not needed. The reason it's DATALINES4 rather then DATALINES is because otherwise SAS will take the first semicolon in the data to indicate end-of-data. DATALINES4 tells SAS that 4 consecutive semicolons are required to indicate end of data. (So you can drop the line of 4 semicolons also).
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.