Solved: webscraping links from a web page - Page 2

ilikesas · Posted 12-10-2016 11:47 PM

OK, I see now, if I don't use htmldecode function than the ' in that's will appear as '

and the reason why htmldecode wasn't used for getting the link is because in the link symbols such as ' don't appear. But to be safe, I guess that if I want to extract text then I should always use the htldecode function.

Thanks Ksharp for your guidance!

Ksharp · Posted 12-11-2016 06:46 AM

Another way is using:

 title=htmldecode(substr(scan(x,6,'"'),14)); 

----->

 title=scan(scan(x,9,'"'),1,'<>'); 


Both could get the same thing .

Ksharp · Posted 12-09-2016 12:29 AM


%macro rick;
%do n=78 %to 82;
filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/&n";
data blog_links_&n;
infile rick length=len lrecl=32767;
input line $varying32767. len;
p=find(line,'<h2 class="entry-title"><a href="http://blogs.sas.com/content/iml/');
if p then do;
 x=substr(line,p);
 link=scan(x,4,'"');
 title=htmldecode(substr(scan(x,6,'"'),14)); 
output;
end;
keep link title;
run;

proc sql;
create table _null_ as
 select * from  blog_links_&n;
quit;

%if &sqlobs=0 %then %return ;

%end;

%mend;

%rick

data rick_blogs;
 set blog_links_:;
run;
proc print;run;

ChrisHemedinger · Posted 12-11-2016 11:44 AM

I love this topic for so many reasons -- it's an innovative use for SAS to pull data from the web, and you all are obviously fans of one of our top blog authors, @Rick_SAS.

Aside from scraping HTML pages, there is another approach you can use: pull the XML-based RSS feed. These XML-based representations are more like regular data, and the XML libname engine can read it as such. Downside: the RSS feeds show only the more recent blogs. You could pull data from this point forward and build a list, but it might be hard to go back into the history. Here's the the DO Loop RSS: http://feeds.feedburner.com/TheDoLoop.

At SAS, we also slice and dice our blog data, but we have an advantage -- we can point SAS directly at our blog database. I've written a paper about our process here.

If you simply want the data (and aren't as invested in the journey of parsing HTML), there is one more thing you can try: just ask us! Here -- I've attached a CSV file of all of The DO Loop posts to-date combined with the tag terms. If you want the larger inventory of technical blogs (The SAS Dummy, Graphically Speaking, The Learning Post, SAS Users) -- I can easily add to it. Want this on a regular basis? I can try make that happen -- I'll have to think about the best way to automate it...

SAS Innovate 2025: Call for Content! Submit your proposals before Sept 16. Accepted presenters get amazing perks to attend the conference!

ilikesas · Posted 12-11-2016 05:01 PM

Hi Chris,

It is in fact very interesting to me to know how to parse data from the internet. SAS is great at analyzing data, but you have to have the data in the first place, and the internet seems like an untapped source!

As for the blogs, it would be great if there were some sort of a table of contents of the authors' blogs like this newcomers can see what was posted before, and I personally found (and still keep searching for) a lot of valuable information!

Maybe you could even post Ksharp's code on the blog in order to let readers learn about webscraping and exploring the blog contents?

Thanks!

Rick_SAS · Posted 12-12-2016 06:17 AM

Jiangtang Hu and Charlie Huang each did something like this back in 2011. Jiantang's blog seems to be gone, but here is a lin kto Charlie Huang's macro and his subsequent statistical analysis of topics: http://blog.sasanalysis.com/2011/10/rick-wicklins-195th-blog-posts.html

Re: webscraping links from a web page

Re: webscraping links from a web page

Re: webscraping links from a web page

Re: webscraping links from a web page

Re: webscraping links from a web page

Re: webscraping links from a web page

Re: webscraping links from a web page

Re: webscraping links from a web page

Re: webscraping links from a web page

Re: webscraping links from a web page

Re: webscraping links from a web page

Re: webscraping links from a web page

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!