OK, I see now, if I don't use htmldecode function than the ' in that's will appear as '
and the reason why htmldecode wasn't used for getting the link is because in the link symbols such as ' don't appear. But to be safe, I guess that if I want to extract text then I should always use the htldecode function.
Thanks Ksharp for your guidance!
Another way is using: title=htmldecode(substr(scan(x,6,'"'),14)); -----> title=scan(scan(x,9,'"'),1,'<>'); Both could get the same thing .
%macro rick;
%do n=78 %to 82;
filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/&n";
data blog_links_&n;
infile rick length=len lrecl=32767;
input line $varying32767. len;
p=find(line,'<h2 class="entry-title"><a href="http://blogs.sas.com/content/iml/');
if p then do;
x=substr(line,p);
link=scan(x,4,'"');
title=htmldecode(substr(scan(x,6,'"'),14));
output;
end;
keep link title;
run;
proc sql;
create table _null_ as
select * from blog_links_&n;
quit;
%if &sqlobs=0 %then %return ;
%end;
%mend;
%rick
data rick_blogs;
set blog_links_:;
run;
proc print;run;
I love this topic for so many reasons -- it's an innovative use for SAS to pull data from the web, and you all are obviously fans of one of our top blog authors, @Rick_SAS.
Aside from scraping HTML pages, there is another approach you can use: pull the XML-based RSS feed. These XML-based representations are more like regular data, and the XML libname engine can read it as such. Downside: the RSS feeds show only the more recent blogs. You could pull data from this point forward and build a list, but it might be hard to go back into the history. Here's the the DO Loop RSS: http://feeds.feedburner.com/TheDoLoop.
At SAS, we also slice and dice our blog data, but we have an advantage -- we can point SAS directly at our blog database. I've written a paper about our process here.
If you simply want the data (and aren't as invested in the journey of parsing HTML), there is one more thing you can try: just ask us! Here -- I've attached a CSV file of all of The DO Loop posts to-date combined with the tag terms. If you want the larger inventory of technical blogs (The SAS Dummy, Graphically Speaking, The Learning Post, SAS Users) -- I can easily add to it. Want this on a regular basis? I can try make that happen -- I'll have to think about the best way to automate it...
Hi Chris,
It is in fact very interesting to me to know how to parse data from the internet. SAS is great at analyzing data, but you have to have the data in the first place, and the internet seems like an untapped source!
As for the blogs, it would be great if there were some sort of a table of contents of the authors' blogs like this newcomers can see what was posted before, and I personally found (and still keep searching for) a lot of valuable information!
Maybe you could even post Ksharp's code on the blog in order to let readers learn about webscraping and exploring the blog contents?
Thanks!
Jiangtang Hu and Charlie Huang each did something like this back in 2011. Jiantang's blog seems to be gone, but here is a lin kto Charlie Huang's macro and his subsequent statistical analysis of topics: http://blog.sasanalysis.com/2011/10/rick-wicklins-195th-blog-posts.html
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.