BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ilikesas
Barite | Level 11

OK, I see now, if I don't use htmldecode function than the ' in that's will appear as ' 

 

and the reason why htmldecode wasn't used for getting the link is because in the link symbols such as ' don't appear. But to be safe, I guess that if I want to extract text then I should always use the htldecode function.

 

Thanks Ksharp for your guidance!

Ksharp
Super User
Another way is using:

 title=htmldecode(substr(scan(x,6,'"'),14)); 

----->

 title=scan(scan(x,9,'"'),1,'<>'); 


Both could get the same thing .


Ksharp
Super User

%macro rick;
%do n=78 %to 82;
filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/&n";
data blog_links_&n;
infile rick length=len lrecl=32767;
input line $varying32767. len;
p=find(line,'<h2 class="entry-title"><a href="http://blogs.sas.com/content/iml/');
if p then do;
 x=substr(line,p);
 link=scan(x,4,'"');
 title=htmldecode(substr(scan(x,6,'"'),14)); 
output;
end;
keep link title;
run;

proc sql;
create table _null_ as
 select * from  blog_links_&n;
quit;

%if &sqlobs=0 %then %return ;

%end;

%mend;

%rick

data rick_blogs;
 set blog_links_:;
run;
proc print;run;
ChrisHemedinger
Community Manager

I love this topic for so many reasons -- it's an innovative use for SAS to pull data from the web, and you all are obviously fans of one of our top blog authors, @Rick_SAS.

 

Aside from scraping HTML pages, there is another approach you can use: pull the XML-based RSS feed.  These XML-based representations are more like regular data, and the XML libname engine can read it as such.  Downside: the RSS feeds show only the more recent blogs.  You could pull data from this point forward and build a list, but it might be hard to go back into the history.  Here's the the DO Loop RSS: http://feeds.feedburner.com/TheDoLoop.

 

At SAS, we also slice and dice our blog data, but we have an advantage -- we can point SAS directly at our blog database. I've written a paper about our process here.

 

If you simply want the data (and aren't as invested in the journey of parsing HTML), there is one more thing you can try: just ask us!  Here -- I've attached a CSV file of all of The DO Loop posts to-date combined with the tag terms.  If you want the larger inventory of technical blogs (The SAS Dummy, Graphically Speaking, The Learning Post, SAS Users) -- I can easily add to it.  Want this on a regular basis? I can try make that happen -- I'll have to think about the best way to automate it...

SAS Innovate 2025: Call for Content! Submit your proposals before Sept 16. Accepted presenters get amazing perks to attend the conference!
ilikesas
Barite | Level 11

Hi Chris,

 

It is in fact very interesting to me to know how to parse data from the internet. SAS is great at analyzing data, but you have to have the data in the first place, and the internet seems like an untapped source! 

 

As for the blogs, it would be great if there were some sort of a table of contents of the authors' blogs like this newcomers can see what was posted before, and I personally found (and still keep searching for) a lot of valuable information!

Maybe you could even post Ksharp's code on the blog in order to let readers learn about webscraping and exploring the blog contents?

 

Thanks!

 

 

Rick_SAS
SAS Super FREQ

Jiangtang Hu and Charlie Huang each did something like this back in 2011. Jiantang's blog seems to be gone, but here is a lin kto Charlie Huang's macro and his subsequent statistical analysis of topics: http://blog.sasanalysis.com/2011/10/rick-wicklins-195th-blog-posts.html

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 20 replies
  • 3189 views
  • 14 likes
  • 5 in conversation