BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kaziumair
Quartz | Level 8

Hi everyone , I am trying to scrape a website with a load more button . Initially the website shows around 20 articles , in order to see more articles the user is required to press the load more button or scroll down.

Is there any way we can bypass this and scrape the website?

This is the website I am trying to scrape:

https://www.dailymaverick.co.za/section/world/ 

 

I just want to extract the main headlines from the articles.

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisHemedinger
Community Manager

For this site, it looks like you might be able to get an RSS feed (XML):

 

https://www.dailymaverick.co.za/dmrss/

 

I've shared how to fetch/parse RSS feeds in this article.

 

For fun, I applied the technique to this source:

 

/* Copyright SAS Institute Inc. */

filename rssmap temp;
data _null_;
infile datalines;
file rssmap;
input;
put _infile_;
datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
    <NAMESPACES count="0"/>
    <!-- ############################################################ -->
    <TABLE name="item">
        <TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
        <COLUMN name="title">
            <PATH syntax="XPath">/rss/channel/item/title</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>250</LENGTH>
        </COLUMN>
        <COLUMN name="link">
            <PATH syntax="XPath">/rss/channel/item/link</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>200</LENGTH>
        </COLUMN>
        <COLUMN name="pubDate">
            <PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>40</LENGTH>
        </COLUMN>
    </TABLE>
</SXLEMAP>
;
run;

/* WordPress feeds return data in pages, 25 entries at a time        */
/* So using a short macro to loop through past 5 pages, or 125 items */
%macro getItems;
  %do i = 1 %to 5;
  filename feed temp;
  proc http
   method="get"
   url="https://www.dailymaverick.co.za/dmrss?paged=&i."
   out=feed;
  run;
 
  libname result XMLv2 xmlfileref=feed xmlmap=rssmap;
 
  data posts_&i.;
   set result.item;
  run;
  %end;
%mend;
 
%getItems;
 
/* Assemble all pages of entries                       */
/* Cast the date field into a proper SAS date          */
/* Have to strip out the default day name abbreviation */
/* "Wed, 10 Apr 2019 17:36:27 +0000" -> 10APR2019      */
data allPosts ;
 set posts_:;
 length sasPubdate 8;
 sasPubdate = input( substr(pubDate,4),anydtdtm.);
 format sasPubdate dtdate9.;
 drop pubDate;
run;

Result:

rssfeed.jpg

 

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!

View solution in original post

5 REPLIES 5
ChrisHemedinger
Community Manager

For this site, it looks like you might be able to get an RSS feed (XML):

 

https://www.dailymaverick.co.za/dmrss/

 

I've shared how to fetch/parse RSS feeds in this article.

 

For fun, I applied the technique to this source:

 

/* Copyright SAS Institute Inc. */

filename rssmap temp;
data _null_;
infile datalines;
file rssmap;
input;
put _infile_;
datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
    <NAMESPACES count="0"/>
    <!-- ############################################################ -->
    <TABLE name="item">
        <TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
        <COLUMN name="title">
            <PATH syntax="XPath">/rss/channel/item/title</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>250</LENGTH>
        </COLUMN>
        <COLUMN name="link">
            <PATH syntax="XPath">/rss/channel/item/link</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>200</LENGTH>
        </COLUMN>
        <COLUMN name="pubDate">
            <PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>40</LENGTH>
        </COLUMN>
    </TABLE>
</SXLEMAP>
;
run;

/* WordPress feeds return data in pages, 25 entries at a time        */
/* So using a short macro to loop through past 5 pages, or 125 items */
%macro getItems;
  %do i = 1 %to 5;
  filename feed temp;
  proc http
   method="get"
   url="https://www.dailymaverick.co.za/dmrss?paged=&i."
   out=feed;
  run;
 
  libname result XMLv2 xmlfileref=feed xmlmap=rssmap;
 
  data posts_&i.;
   set result.item;
  run;
  %end;
%mend;
 
%getItems;
 
/* Assemble all pages of entries                       */
/* Cast the date field into a proper SAS date          */
/* Have to strip out the default day name abbreviation */
/* "Wed, 10 Apr 2019 17:36:27 +0000" -> 10APR2019      */
data allPosts ;
 set posts_:;
 length sasPubdate 8;
 sasPubdate = input( substr(pubDate,4),anydtdtm.);
 format sasPubdate dtdate9.;
 drop pubDate;
run;

Result:

rssfeed.jpg

 

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!
kaziumair
Quartz | Level 8
Hi , thank you for your guidance . Just wanted to ask whether there is a way to fetch RSS feed only from a particular section of the website? As this RSS feed seems to fetch data from all the sections available in the website .
ChrisHemedinger
Community Manager

This appears to be a WordPress site, so the options for different categories might be there. However, it will require some exploration. So far I've seen only the one main feed.

 

The RSS feed does have a category field. Example:

 

<category>South Africa</category>

You could use this in your SAS process to filter items from the data after you fetch it.

 

Revised code to capture the category:

 

filename rssmap temp;
data _null_;
infile datalines;
file rssmap;
input;
put _infile_;
datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
    <NAMESPACES count="0"/>
    <!-- ############################################################ -->
    <TABLE name="item">
        <TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
        <COLUMN name="title">
            <PATH syntax="XPath">/rss/channel/item/title</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>250</LENGTH>
        </COLUMN>
        <COLUMN name="link">
            <PATH syntax="XPath">/rss/channel/item/link</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>200</LENGTH>
        </COLUMN>
        <COLUMN name="pubDate">
            <PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>40</LENGTH>
        </COLUMN>
         <COLUMN name="category">
            <PATH syntax="XPath">/rss/channel/item/category</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>40</LENGTH>
        </COLUMN>
    </TABLE>
</SXLEMAP>
;
run;

/* WordPress feeds return data in pages, 25 entries at a time        */
/* So using a short macro to loop through past 5 pages, or 125 items */
%macro getItems;
  %do i = 1 %to 5;
  filename feed temp;
  proc http
   method="get"
   url="https://www.dailymaverick.co.za/dmrss?paged=&i."
   out=feed;
  run;
 
  libname result XMLv2 xmlfileref=feed xmlmap=rssmap;
 
  data posts_&i.;
   set result.item;
  run;
  %end;
%mend;
 
%getItems;
 
/* Assemble all pages of entries                       */
/* Cast the date field into a proper SAS date          */
/* Have to strip out the default day name abbreviation */
/* "Wed, 10 Apr 2019 17:36:27 +0000" -> 10APR2019      */
data allPosts ;
 set posts_:;
 length sasPubdate 8;
 sasPubdate = input( substr(pubDate,4),anydtdtm.);
 format sasPubdate dtdate9.;
 drop pubDate;
run;

rssfeed.jpg

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!
kaziumair
Quartz | Level 8
Thanks a lot for your help and guidance
kaziumair
Quartz | Level 8

Hi , how can I scrape a website with load more button and no rss feed. Is there a way to call javascript that executes the "load more" functionality in sas 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 3300 views
  • 2 likes
  • 2 in conversation