<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: webscraping links from a web page in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317792#M69602</link>
    <description>&lt;P&gt;Wow, thanks Ksharp for the macro! Its actually an insight to me that macro variables could be used in the "filename".&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In your code you do the macro for page 1 to n. But suppsoe that you want to get all the pages but don't know how many pages there are, so is it possible to do the macro incrementally from pages 1 up to the point when a certain page doesn't exist and that's when the macro should stop?&lt;/P&gt;</description>
    <pubDate>Fri, 09 Dec 2016 04:19:29 GMT</pubDate>
    <dc:creator>ilikesas</dc:creator>
    <dc:date>2016-12-09T04:19:29Z</dc:date>
    <item>
      <title>webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317764#M69589</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I am reading Rick Wicklin's SAS blogs, and what I would like to do is to scrape the full list of his posts, otherwise there are too many pages (currently 79) and it is tedious to go over each page and see what posts have been posted.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The approach that I undertook is to go to a certain page and scrape the lines that end with .html (this is how the links end) and which are confined between quotes. Here is the code I tried:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/2";
data blog_links(keep=link);
length link $200;
infile rick length=len lrecl=32767;
input line $varying32767. len;
if find(line,".html") then do;
link=scan(line,2,'"');
output;
end;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;But I didn't get any links at all. Could you please help me?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thank you!&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 09 Dec 2016 02:01:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317764#M69589</guid>
      <dc:creator>ilikesas</dc:creator>
      <dc:date>2016-12-09T02:01:47Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317766#M69590</link>
      <description>not sure I follow the logic behind &lt;BR /&gt;&lt;BR /&gt;link=scan(line,2,'"');&lt;BR /&gt;&lt;BR /&gt;do any or all links on those pages begin at position 2 and finish with double quotes?</description>
      <pubDate>Fri, 09 Dec 2016 02:24:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317766#M69590</guid>
      <dc:creator>Damien_Mather</dc:creator>
      <dc:date>2016-12-09T02:24:42Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317773#M69593</link>
      <description>&lt;P&gt;The links start and finish with quatation marks. As for the position, I thought that I had to put the number 2 because html is of the form &amp;lt;a ... and therefore I thought that it starts with position 2. But then I counted the position of the "h" in the "http..." and it turned out to be 8 and when I put 8 instead of 2 I did actually get the links!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But besides the links I get lines that don't contain at all the "html". Why is that so?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Fri, 09 Dec 2016 02:57:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317773#M69593</guid>
      <dc:creator>ilikesas</dc:creator>
      <dc:date>2016-12-09T02:57:17Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317778#M69595</link>
      <description>&lt;P&gt;How about this one. It seems that each page contain 10 blogs in Rick's blog website.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/2";
data blog_links;
infile rick length=len lrecl=32767;
input line $varying32767. len;
p=find(line,'&amp;lt;h2 class="entry-title"&amp;gt;&amp;lt;a href="http://blogs.sas.com/content/iml/');
if p then do;
 x=substr(line,p);
 link=scan(x,4,'"');
 title=substr(scan(x,6,'"'),14); 
output;
end;
keep link title;
run;
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 09 Dec 2016 03:09:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317778#M69595</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2016-12-09T03:09:30Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317780#M69596</link>
      <description>&lt;P&gt;Thanks Ksharp! &amp;nbsp;Your code gives me only the links and the titles ina neat way!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In a logic continuation of the question, is it possible to get all the links from page 1 up to the last page?&lt;/P&gt;</description>
      <pubDate>Fri, 09 Dec 2016 03:16:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317780#M69596</guid>
      <dc:creator>ilikesas</dc:creator>
      <dc:date>2016-12-09T03:16:19Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317789#M69600</link>
      <description>&lt;P&gt;Make a macro to replcae 1 2 3 4.........&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE class=" language-sas"&gt;&lt;CODE class="  language-sas"&gt;&lt;SPAN class="token statement"&gt;filename&lt;/SPAN&gt; rick url &lt;SPAN class="token string"&gt;"http://blogs.sas.com/content/iml/author/rickwicklin/page/&amp;amp;n"&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I think it is easy for you .&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 09 Dec 2016 03:51:36 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317789#M69600</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2016-12-09T03:51:36Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317791#M69601</link>
      <description>&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;
%macro rick(n);
filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/&amp;amp;n";
data blog_links_&amp;amp;n;
infile rick length=len lrecl=32767;
input line $varying32767. len;
p=find(line,'&amp;lt;h2 class="entry-title"&amp;gt;&amp;lt;a href="http://blogs.sas.com/content/iml/');
if p then do;
 x=substr(line,p);
 link=scan(x,4,'"');
 title=substr(scan(x,6,'"'),14); 
output;
end;
keep link title;
run;
%mend;

data _null_;
 do n=1 to 4;
  call execute(cats('%rick(',n,')'));
 end;
run;

data rick_blogs;
 set blog_links_:;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 09 Dec 2016 04:11:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317791#M69601</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2016-12-09T04:11:38Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317792#M69602</link>
      <description>&lt;P&gt;Wow, thanks Ksharp for the macro! Its actually an insight to me that macro variables could be used in the "filename".&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In your code you do the macro for page 1 to n. But suppsoe that you want to get all the pages but don't know how many pages there are, so is it possible to do the macro incrementally from pages 1 up to the point when a certain page doesn't exist and that's when the macro should stop?&lt;/P&gt;</description>
      <pubDate>Fri, 09 Dec 2016 04:19:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317792#M69602</guid>
      <dc:creator>ilikesas</dc:creator>
      <dc:date>2016-12-09T04:19:29Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317795#M69605</link>
      <description>&lt;P&gt;You can make it as a very big number .&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;do n=1 to 400;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 09 Dec 2016 04:45:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317795#M69605</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2016-12-09T04:45:09Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317798#M69606</link>
      <description>&lt;P&gt;This would look better.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;
%macro rick(n);
filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/&amp;amp;n";
data blog_links_&amp;amp;n;
infile rick length=len lrecl=32767;
input line $varying32767. len;
p=find(line,'&amp;lt;h2 class="entry-title"&amp;gt;&amp;lt;a href="http://blogs.sas.com/content/iml/');
if p then do;
 x=substr(line,p);
 link=scan(x,4,'"');
 title=htmldecode(substr(scan(x,6,'"'),14)); 
output;
end;
keep link title;
run;
%mend;

data _null_;
 do n=1 to 4;
  call execute(cats('%rick(',n,')'));
 end;
run;

data rick_blogs;
 set blog_links_:;
run;
proc print;run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 09 Dec 2016 04:51:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317798#M69606</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2016-12-09T04:51:38Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317799#M69607</link>
      <description>&lt;P&gt;OK. Here is:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;
%macro rick;
%do n=78 %to 82;
filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/&amp;amp;n";
data blog_links_&amp;amp;n;
infile rick length=len lrecl=32767;
input line $varying32767. len;
p=find(line,'&amp;lt;h2 class="entry-title"&amp;gt;&amp;lt;a href="http://blogs.sas.com/content/iml/');
if p then do;
 x=substr(line,p);
 link=scan(x,4,'"');
 title=htmldecode(substr(scan(x,6,'"'),14)); 
output;
end;
keep link title;
run;

proc sql;
create table _null_ as
 select * from  blog_links_&amp;amp;n;
quit;

%if &amp;amp;sqlobs=0 %then %ABORT;

%end;

%mend;

%rick

data rick_blogs;
 set blog_links_:;
run;
proc print;run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 09 Dec 2016 05:13:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317799#M69607</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2016-12-09T05:13:54Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317801#M69608</link>
      <description>&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;
%macro rick;
%do n=78 %to 82;
filename rick url "http://blogs.sas.com/content/iml/author/rickwicklin/page/&amp;amp;n";
data blog_links_&amp;amp;n;
infile rick length=len lrecl=32767;
input line $varying32767. len;
p=find(line,'&amp;lt;h2 class="entry-title"&amp;gt;&amp;lt;a href="http://blogs.sas.com/content/iml/');
if p then do;
 x=substr(line,p);
 link=scan(x,4,'"');
 title=htmldecode(substr(scan(x,6,'"'),14)); 
output;
end;
keep link title;
run;

proc sql;
create table _null_ as
 select * from  blog_links_&amp;amp;n;
quit;

%if &amp;amp;sqlobs=0 %then %return ;

%end;

%mend;

%rick

data rick_blogs;
 set blog_links_:;
run;
proc print;run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 09 Dec 2016 05:29:08 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/317801#M69608</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2016-12-09T05:29:08Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318059#M69653</link>
      <description>&lt;P&gt;Thanks Ksharp for the solution!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I just have a question - why do you use the "htmldecode" to get the title, and why do you use the substr with position # 14 if the scan(x,6,' " ') already should (theoretically) give the title?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Sat, 10 Dec 2016 04:59:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318059#M69653</guid>
      <dc:creator>ilikesas</dc:creator>
      <dc:date>2016-12-10T04:59:48Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318068#M69656</link>
      <description>&lt;PRE&gt;
In Rick's bolg title ,there are something like &amp;lt; .... these html source code.
I use htmldecode() to translate &amp;lt;: into &amp;lt; .

substr(,,14) is to get blog title, you want why ? check html source code, (right click --&amp;gt; View Page Source  in Chrome)
you will know why I code this .

&lt;/PRE&gt;</description>
      <pubDate>Sat, 10 Dec 2016 10:43:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318068#M69656</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2016-12-10T10:43:01Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318076#M69659</link>
      <description>&lt;P&gt;When I right click and select Inspect element I see the html code.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I think that I begin to understand. I tried to get the title in a similar way as the link: title = scan(x,6,' " ') --&amp;gt; but here I got "Permalink to TITLE" (becasue this entire string is between the 5th and 6th delimiter), and the length of the substring "Permalink to " is in fact 13, so by doing the substr from position 14 you get rid of that substring.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But then I did title = substr(scan(x,6,' " '),14) and got the title, so what does htmldecode exactly do? I searched on Google and the only thing that I found about htmldecode is that it decodes html text, for example it transforms &amp;amp;lt to &amp;lt;. So what is htmldecode doing in this case? You said that it translated &amp;lt;: into &amp;lt;, but I don't quite understand what this means. Could you please clarify to me?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Sat, 10 Dec 2016 17:50:51 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318076#M69659</guid>
      <dc:creator>ilikesas</dc:creator>
      <dc:date>2016-12-10T17:50:51Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318105#M69672</link>
      <description>&lt;P&gt;Yes. you are right. If you don't use HTMLDECODE() ,you gonna see some strange character like &amp;amp;lt; .......... in blogs' title&lt;/P&gt;
&lt;P&gt;use htmldecode() can translate them into normal character like &amp;lt; ......&lt;/P&gt;
&lt;P&gt;&amp;nbsp;or instead, you could try&lt;FONT color="#FF0000"&gt;&lt;STRONG&gt;&amp;nbsp; title = scan(x,9,' " ');&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For example:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;title="Permalink to Ahh, that&lt;FONT color="#FF0000"&gt;&lt;STRONG&gt;&amp;amp;#039;&lt;/STRONG&gt;&lt;/FONT&gt;s smooth! Anti-aliasing in SAS statistical graphics"

----htmldecode()---&amp;gt;

Ahh, that's smooth! Anti-aliasing in SAS statistical graphics

&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 11 Dec 2016 04:13:12 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318105#M69672</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2016-12-11T04:13:12Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318111#M69676</link>
      <description>&lt;P&gt;OK, I see now, if I don't use htmldecode function than the ' in that's will appear as &amp;amp;#039;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;and the reason why htmldecode wasn't used for getting the link is because in the link symbols such as ' don't appear. But to be safe, I guess that if I want to extract text then I should always use the htldecode function.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks Ksharp for your guidance!&lt;/P&gt;</description>
      <pubDate>Sun, 11 Dec 2016 04:47:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318111#M69676</guid>
      <dc:creator>ilikesas</dc:creator>
      <dc:date>2016-12-11T04:47:39Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318123#M69680</link>
      <description>&lt;PRE&gt;
Another way is using:

 title=htmldecode(substr(scan(x,6,'"'),14)); 

-----&amp;gt;

 title=scan(scan(x,9,'"'),1,'&amp;lt;&amp;gt;'); 


Both could get the same thing .


&lt;/PRE&gt;</description>
      <pubDate>Sun, 11 Dec 2016 11:46:49 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318123#M69680</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2016-12-11T11:46:49Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318136#M69686</link>
      <description>&lt;P&gt;I love this topic for so many reasons -- it's an innovative use for SAS to pull data from the web, and you all are obviously fans of one of our top blog authors, &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13684"&gt;@Rick_SAS&lt;/a&gt;.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Aside from scraping HTML pages, there is another approach you can use: pull the XML-based RSS feed. &amp;nbsp;These XML-based representations are more like regular data, and the XML libname engine can read it as such.&amp;nbsp; Downside: the RSS feeds show only the more recent blogs. &amp;nbsp;You could pull data from this point forward and build a list, but it might be hard to go back into the history. &amp;nbsp;Here's the the DO Loop RSS: &lt;A href="http://feeds.feedburner.com/TheDoLoop" target="_blank"&gt;http://feeds.feedburner.com/TheDoLoop&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;At SAS, we also slice and dice our blog data, but we have an advantage -- we can point SAS directly at our blog database. &lt;A href="https://support.sas.com/resources/papers/proceedings15/SAS1708-2015.pdf" target="_self"&gt;I've written a paper about our process here&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you simply want the data (and aren't as invested in the journey of parsing HTML), there is one more&amp;nbsp;thing you can try: &lt;STRONG&gt;just ask us&lt;/STRONG&gt;! &amp;nbsp;Here -- I've attached a CSV file of&amp;nbsp;&lt;STRONG&gt;all&lt;/STRONG&gt; of The DO Loop posts to-date combined with the tag terms. &amp;nbsp;If you want the larger inventory of technical blogs (The SAS Dummy, Graphically Speaking, The Learning Post, SAS Users) -- I can easily add to it. &amp;nbsp;Want this on a regular basis? I can try make that happen -- I'll have to think about the best way to automate it...&lt;/P&gt;</description>
      <pubDate>Sun, 11 Dec 2016 16:45:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318136#M69686</guid>
      <dc:creator>ChrisHemedinger</dc:creator>
      <dc:date>2016-12-11T16:45:20Z</dc:date>
    </item>
    <item>
      <title>Re: webscraping links from a web page</title>
      <link>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318172#M69693</link>
      <description>&lt;P&gt;Hi Chris,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;It is in fact very interesting to me to know how to parse data from the internet. SAS is great at analyzing data, but you have to have the data in the first place, and the internet seems like an untapped source!&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;As for the blogs, it would be great if there were some sort of a table of contents of the authors' blogs like this newcomers can see what was posted before, and I personally found (and still keep searching for) a lot of valuable information!&lt;/P&gt;
&lt;P&gt;Maybe you could even post Ksharp's code on the blog in order to let readers learn about webscraping and exploring the blog contents?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 11 Dec 2016 22:02:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/webscraping-links-from-a-web-page/m-p/318172#M69693</guid>
      <dc:creator>ilikesas</dc:creator>
      <dc:date>2016-12-11T22:02:53Z</dc:date>
    </item>
  </channel>
</rss>

