BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kaziumair
Quartz | Level 8

Hi everyone , I am trying to scrape a website with a load more button . Initially the website shows around 20 articles , in order to see more articles the user is required to press the load more button or scroll down.

Is there any way we can bypass this and scrape the website?

This is the website I am trying to scrape:

https://www.dailymaverick.co.za/section/world/ 

 

I just want to extract the main headlines from the articles.

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisHemedinger
Community Manager

For this site, it looks like you might be able to get an RSS feed (XML):

 

https://www.dailymaverick.co.za/dmrss/

 

I've shared how to fetch/parse RSS feeds in this article.

 

For fun, I applied the technique to this source:

 

/* Copyright SAS Institute Inc. */

filename rssmap temp;
data _null_;
infile datalines;
file rssmap;
input;
put _infile_;
datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
    <NAMESPACES count="0"/>
    <!-- ############################################################ -->
    <TABLE name="item">
        <TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
        <COLUMN name="title">
            <PATH syntax="XPath">/rss/channel/item/title</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>250</LENGTH>
        </COLUMN>
        <COLUMN name="link">
            <PATH syntax="XPath">/rss/channel/item/link</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>200</LENGTH>
        </COLUMN>
        <COLUMN name="pubDate">
            <PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>40</LENGTH>
        </COLUMN>
    </TABLE>
</SXLEMAP>
;
run;

/* WordPress feeds return data in pages, 25 entries at a time        */
/* So using a short macro to loop through past 5 pages, or 125 items */
%macro getItems;
  %do i = 1 %to 5;
  filename feed temp;
  proc http
   method="get"
   url="https://www.dailymaverick.co.za/dmrss?paged=&i."
   out=feed;
  run;
 
  libname result XMLv2 xmlfileref=feed xmlmap=rssmap;
 
  data posts_&i.;
   set result.item;
  run;
  %end;
%mend;
 
%getItems;
 
/* Assemble all pages of entries                       */
/* Cast the date field into a proper SAS date          */
/* Have to strip out the default day name abbreviation */
/* "Wed, 10 Apr 2019 17:36:27 +0000" -> 10APR2019      */
data allPosts ;
 set posts_:;
 length sasPubdate 8;
 sasPubdate = input( substr(pubDate,4),anydtdtm.);
 format sasPubdate dtdate9.;
 drop pubDate;
run;

Result:

rssfeed.jpg

 

Check out SAS Innovate on-demand content! Watch the main stage sessions, keynotes, and over 20 technical breakout sessions!

View solution in original post

5 REPLIES 5
ChrisHemedinger
Community Manager

For this site, it looks like you might be able to get an RSS feed (XML):

 

https://www.dailymaverick.co.za/dmrss/

 

I've shared how to fetch/parse RSS feeds in this article.

 

For fun, I applied the technique to this source:

 

/* Copyright SAS Institute Inc. */

filename rssmap temp;
data _null_;
infile datalines;
file rssmap;
input;
put _infile_;
datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
    <NAMESPACES count="0"/>
    <!-- ############################################################ -->
    <TABLE name="item">
        <TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
        <COLUMN name="title">
            <PATH syntax="XPath">/rss/channel/item/title</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>250</LENGTH>
        </COLUMN>
        <COLUMN name="link">
            <PATH syntax="XPath">/rss/channel/item/link</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>200</LENGTH>
        </COLUMN>
        <COLUMN name="pubDate">
            <PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>40</LENGTH>
        </COLUMN>
    </TABLE>
</SXLEMAP>
;
run;

/* WordPress feeds return data in pages, 25 entries at a time        */
/* So using a short macro to loop through past 5 pages, or 125 items */
%macro getItems;
  %do i = 1 %to 5;
  filename feed temp;
  proc http
   method="get"
   url="https://www.dailymaverick.co.za/dmrss?paged=&i."
   out=feed;
  run;
 
  libname result XMLv2 xmlfileref=feed xmlmap=rssmap;
 
  data posts_&i.;
   set result.item;
  run;
  %end;
%mend;
 
%getItems;
 
/* Assemble all pages of entries                       */
/* Cast the date field into a proper SAS date          */
/* Have to strip out the default day name abbreviation */
/* "Wed, 10 Apr 2019 17:36:27 +0000" -> 10APR2019      */
data allPosts ;
 set posts_:;
 length sasPubdate 8;
 sasPubdate = input( substr(pubDate,4),anydtdtm.);
 format sasPubdate dtdate9.;
 drop pubDate;
run;

Result:

rssfeed.jpg

 

Check out SAS Innovate on-demand content! Watch the main stage sessions, keynotes, and over 20 technical breakout sessions!
kaziumair
Quartz | Level 8
Hi , thank you for your guidance . Just wanted to ask whether there is a way to fetch RSS feed only from a particular section of the website? As this RSS feed seems to fetch data from all the sections available in the website .
ChrisHemedinger
Community Manager

This appears to be a WordPress site, so the options for different categories might be there. However, it will require some exploration. So far I've seen only the one main feed.

 

The RSS feed does have a category field. Example:

 

<category>South Africa</category>

You could use this in your SAS process to filter items from the data after you fetch it.

 

Revised code to capture the category:

 

filename rssmap temp;
data _null_;
infile datalines;
file rssmap;
input;
put _infile_;
datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
    <NAMESPACES count="0"/>
    <!-- ############################################################ -->
    <TABLE name="item">
        <TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
        <COLUMN name="title">
            <PATH syntax="XPath">/rss/channel/item/title</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>250</LENGTH>
        </COLUMN>
        <COLUMN name="link">
            <PATH syntax="XPath">/rss/channel/item/link</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>200</LENGTH>
        </COLUMN>
        <COLUMN name="pubDate">
            <PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>40</LENGTH>
        </COLUMN>
         <COLUMN name="category">
            <PATH syntax="XPath">/rss/channel/item/category</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>40</LENGTH>
        </COLUMN>
    </TABLE>
</SXLEMAP>
;
run;

/* WordPress feeds return data in pages, 25 entries at a time        */
/* So using a short macro to loop through past 5 pages, or 125 items */
%macro getItems;
  %do i = 1 %to 5;
  filename feed temp;
  proc http
   method="get"
   url="https://www.dailymaverick.co.za/dmrss?paged=&i."
   out=feed;
  run;
 
  libname result XMLv2 xmlfileref=feed xmlmap=rssmap;
 
  data posts_&i.;
   set result.item;
  run;
  %end;
%mend;
 
%getItems;
 
/* Assemble all pages of entries                       */
/* Cast the date field into a proper SAS date          */
/* Have to strip out the default day name abbreviation */
/* "Wed, 10 Apr 2019 17:36:27 +0000" -> 10APR2019      */
data allPosts ;
 set posts_:;
 length sasPubdate 8;
 sasPubdate = input( substr(pubDate,4),anydtdtm.);
 format sasPubdate dtdate9.;
 drop pubDate;
run;

rssfeed.jpg

Check out SAS Innovate on-demand content! Watch the main stage sessions, keynotes, and over 20 technical breakout sessions!
kaziumair
Quartz | Level 8
Thanks a lot for your help and guidance
kaziumair
Quartz | Level 8

Hi , how can I scrape a website with load more button and no rss feed. Is there a way to call javascript that executes the "load more" functionality in sas 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 2016 views
  • 2 likes
  • 2 in conversation