DATA Step, Macro, Functions and more

webscraping links from a web page

Accepted Solution Solved
Reply
Super Contributor
Posts: 441
Accepted Solution

webscraping links from a web page

Hi,

 

I am reading Rick Wicklin's SAS blogs, and what I would like to do is to scrape the full list of his posts, otherwise there are too many pages (currently 79) and it is tedious to go over each page and see what posts have been posted.

 

The approach that I undertook is to go to a certain page and scrape the lines that end with .html (this is how the links end) and which are confined between quotes. Here is the code I tried:

 

filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/2";
data blog_links(keep=link);
length link $200;
infile rick length=len lrecl=32767;
input line $varying32767. len;
if find(line,".html") then do;
link=scan(line,2,'"');
output;
end;
run;

But I didn't get any links at all. Could you please help me?

 

Thank you! 


Accepted Solutions
Solution
‎12-09-2016 07:44 PM
Super User
Posts: 10,035

Re: webscraping links from a web page

OK. Here is:

 


%macro rick;
%do n=78 %to 82;
filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/&n";
data blog_links_&n;
infile rick length=len lrecl=32767;
input line $varying32767. len;
p=find(line,'<h2 class="entry-title"><a href="http://blogs.sas.com/content/iml/');
if p then do;
 x=substr(line,p);
 link=scan(x,4,'"');
 title=htmldecode(substr(scan(x,6,'"'),14)); 
output;
end;
keep link title;
run;

proc sql;
create table _null_ as
 select * from  blog_links_&n;
quit;

%if &sqlobs=0 %then %ABORT;

%end;

%mend;

%rick

data rick_blogs;
 set blog_links_:;
run;
proc print;run;

View solution in original post


All Replies
Frequent Contributor
Posts: 136

Re: webscraping links from a web page

not sure I follow the logic behind

link=scan(line,2,'"');

do any or all links on those pages begin at position 2 and finish with double quotes?
Super Contributor
Posts: 441

Re: webscraping links from a web page

Posted in reply to Damien_Mather

The links start and finish with quatation marks. As for the position, I thought that I had to put the number 2 because html is of the form <a ... and therefore I thought that it starts with position 2. But then I counted the position of the "h" in the "http..." and it turned out to be 8 and when I put 8 instead of 2 I did actually get the links!

 

But besides the links I get lines that don't contain at all the "html". Why is that so?

 

 

Thanks!

Super User
Posts: 10,035

Re: webscraping links from a web page

How about this one. It seems that each page contain 10 blogs in Rick's blog website.

 

filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/2";
data blog_links;
infile rick length=len lrecl=32767;
input line $varying32767. len;
p=find(line,'<h2 class="entry-title"><a href="http://blogs.sas.com/content/iml/');
if p then do;
 x=substr(line,p);
 link=scan(x,4,'"');
 title=substr(scan(x,6,'"'),14); 
output;
end;
keep link title;
run;
Super Contributor
Posts: 441

Re: webscraping links from a web page

Thanks Ksharp!  Your code gives me only the links and the titles ina neat way!

 

In a logic continuation of the question, is it possible to get all the links from page 1 up to the last page?

Super User
Posts: 10,035

Re: webscraping links from a web page

Make a macro to replcae 1 2 3 4.........

 

filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/&n";

 

I think it is easy for you . 

Super User
Posts: 10,035

Re: webscraping links from a web page


%macro rick(n);
filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/&n";
data blog_links_&n;
infile rick length=len lrecl=32767;
input line $varying32767. len;
p=find(line,'<h2 class="entry-title"><a href="http://blogs.sas.com/content/iml/');
if p then do;
 x=substr(line,p);
 link=scan(x,4,'"');
 title=substr(scan(x,6,'"'),14); 
output;
end;
keep link title;
run;
%mend;

data _null_;
 do n=1 to 4;
  call execute(cats('%rick(',n,')'));
 end;
run;

data rick_blogs;
 set blog_links_:;
run;
Super Contributor
Posts: 441

Re: webscraping links from a web page

Wow, thanks Ksharp for the macro! Its actually an insight to me that macro variables could be used in the "filename".

 

In your code you do the macro for page 1 to n. But suppsoe that you want to get all the pages but don't know how many pages there are, so is it possible to do the macro incrementally from pages 1 up to the point when a certain page doesn't exist and that's when the macro should stop?

Super User
Posts: 10,035

Re: webscraping links from a web page

You can make it as a very big number .

 

do n=1 to 400;
Super User
Posts: 10,035

Re: webscraping links from a web page

This would look better.

 


%macro rick(n);
filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/&n";
data blog_links_&n;
infile rick length=len lrecl=32767;
input line $varying32767. len;
p=find(line,'<h2 class="entry-title"><a href="http://blogs.sas.com/content/iml/');
if p then do;
 x=substr(line,p);
 link=scan(x,4,'"');
 title=htmldecode(substr(scan(x,6,'"'),14)); 
output;
end;
keep link title;
run;
%mend;

data _null_;
 do n=1 to 4;
  call execute(cats('%rick(',n,')'));
 end;
run;

data rick_blogs;
 set blog_links_:;
run;
proc print;run;
Solution
‎12-09-2016 07:44 PM
Super User
Posts: 10,035

Re: webscraping links from a web page

OK. Here is:

 


%macro rick;
%do n=78 %to 82;
filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/&n";
data blog_links_&n;
infile rick length=len lrecl=32767;
input line $varying32767. len;
p=find(line,'<h2 class="entry-title"><a href="http://blogs.sas.com/content/iml/');
if p then do;
 x=substr(line,p);
 link=scan(x,4,'"');
 title=htmldecode(substr(scan(x,6,'"'),14)); 
output;
end;
keep link title;
run;

proc sql;
create table _null_ as
 select * from  blog_links_&n;
quit;

%if &sqlobs=0 %then %ABORT;

%end;

%mend;

%rick

data rick_blogs;
 set blog_links_:;
run;
proc print;run;
Super Contributor
Posts: 441

Re: webscraping links from a web page

[ Edited ]

Thanks Ksharp for the solution!

 

I just have a question - why do you use the "htmldecode" to get the title, and why do you use the substr with position # 14 if the scan(x,6,' " ') already should (theoretically) give the title?

 

Thanks!

Super User
Posts: 10,035

Re: webscraping links from a web page

In Rick's bolg title ,there are something like < .... these html source code.
I use htmldecode() to translate <: into < .

substr(,,14) is to get blog title, you want why ? check html source code, (right click --> View Page Source  in Chrome)
you will know why I code this .

Super Contributor
Posts: 441

Re: webscraping links from a web page

When I right click and select Inspect element I see the html code.

 

 

I think that I begin to understand. I tried to get the title in a similar way as the link: title = scan(x,6,' " ') --> but here I got "Permalink to TITLE" (becasue this entire string is between the 5th and 6th delimiter), and the length of the substring "Permalink to " is in fact 13, so by doing the substr from position 14 you get rid of that substring.

 

But then I did title = substr(scan(x,6,' " '),14) and got the title, so what does htmldecode exactly do? I searched on Google and the only thing that I found about htmldecode is that it decodes html text, for example it transforms &lt to <. So what is htmldecode doing in this case? You said that it translated <: into <, but I don't quite understand what this means. Could you please clarify to me?

 

Thanks!

Super User
Posts: 10,035

Re: webscraping links from a web page

[ Edited ]

Yes. you are right. If you don't use HTMLDECODE() ,you gonna see some strange character like &lt; .......... in blogs' title

use htmldecode() can translate them into normal character like < ......

 or instead, you could try  title = scan(x,9,' " ');

 

For example:

title="Permalink to Ahh, that&#039;s smooth! Anti-aliasing in SAS statistical graphics"

----htmldecode()--->

Ahh, that's smooth! Anti-aliasing in SAS statistical graphics

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 20 replies
  • 603 views
  • 12 likes
  • 5 in conversation