Hi everyone , I am trying to scrape a website with a load more button . Initially the website shows around 20 articles , in order to see more articles the user is required to press the load more button or scroll down.
Is there any way we can bypass this and scrape the website?
This is the website I am trying to scrape:
https://www.dailymaverick.co.za/section/world/
I just want to extract the main headlines from the articles.
Thanks.
For this site, it looks like you might be able to get an RSS feed (XML):
https://www.dailymaverick.co.za/dmrss/
I've shared how to fetch/parse RSS feeds in this article.
For fun, I applied the technique to this source:
/* Copyright SAS Institute Inc. */
filename rssmap temp;
data _null_;
infile datalines;
file rssmap;
input;
put _infile_;
datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
<NAMESPACES count="0"/>
<!-- ############################################################ -->
<TABLE name="item">
<TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
<COLUMN name="title">
<PATH syntax="XPath">/rss/channel/item/title</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>250</LENGTH>
</COLUMN>
<COLUMN name="link">
<PATH syntax="XPath">/rss/channel/item/link</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>200</LENGTH>
</COLUMN>
<COLUMN name="pubDate">
<PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>40</LENGTH>
</COLUMN>
</TABLE>
</SXLEMAP>
;
run;
/* WordPress feeds return data in pages, 25 entries at a time */
/* So using a short macro to loop through past 5 pages, or 125 items */
%macro getItems;
%do i = 1 %to 5;
filename feed temp;
proc http
method="get"
url="https://www.dailymaverick.co.za/dmrss?paged=&i."
out=feed;
run;
libname result XMLv2 xmlfileref=feed xmlmap=rssmap;
data posts_&i.;
set result.item;
run;
%end;
%mend;
%getItems;
/* Assemble all pages of entries */
/* Cast the date field into a proper SAS date */
/* Have to strip out the default day name abbreviation */
/* "Wed, 10 Apr 2019 17:36:27 +0000" -> 10APR2019 */
data allPosts ;
set posts_:;
length sasPubdate 8;
sasPubdate = input( substr(pubDate,4),anydtdtm.);
format sasPubdate dtdate9.;
drop pubDate;
run;
Result:
For this site, it looks like you might be able to get an RSS feed (XML):
https://www.dailymaverick.co.za/dmrss/
I've shared how to fetch/parse RSS feeds in this article.
For fun, I applied the technique to this source:
/* Copyright SAS Institute Inc. */
filename rssmap temp;
data _null_;
infile datalines;
file rssmap;
input;
put _infile_;
datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
<NAMESPACES count="0"/>
<!-- ############################################################ -->
<TABLE name="item">
<TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
<COLUMN name="title">
<PATH syntax="XPath">/rss/channel/item/title</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>250</LENGTH>
</COLUMN>
<COLUMN name="link">
<PATH syntax="XPath">/rss/channel/item/link</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>200</LENGTH>
</COLUMN>
<COLUMN name="pubDate">
<PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>40</LENGTH>
</COLUMN>
</TABLE>
</SXLEMAP>
;
run;
/* WordPress feeds return data in pages, 25 entries at a time */
/* So using a short macro to loop through past 5 pages, or 125 items */
%macro getItems;
%do i = 1 %to 5;
filename feed temp;
proc http
method="get"
url="https://www.dailymaverick.co.za/dmrss?paged=&i."
out=feed;
run;
libname result XMLv2 xmlfileref=feed xmlmap=rssmap;
data posts_&i.;
set result.item;
run;
%end;
%mend;
%getItems;
/* Assemble all pages of entries */
/* Cast the date field into a proper SAS date */
/* Have to strip out the default day name abbreviation */
/* "Wed, 10 Apr 2019 17:36:27 +0000" -> 10APR2019 */
data allPosts ;
set posts_:;
length sasPubdate 8;
sasPubdate = input( substr(pubDate,4),anydtdtm.);
format sasPubdate dtdate9.;
drop pubDate;
run;
Result:
This appears to be a WordPress site, so the options for different categories might be there. However, it will require some exploration. So far I've seen only the one main feed.
The RSS feed does have a category field. Example:
<category>South Africa</category>
You could use this in your SAS process to filter items from the data after you fetch it.
Revised code to capture the category:
filename rssmap temp;
data _null_;
infile datalines;
file rssmap;
input;
put _infile_;
datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
<NAMESPACES count="0"/>
<!-- ############################################################ -->
<TABLE name="item">
<TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
<COLUMN name="title">
<PATH syntax="XPath">/rss/channel/item/title</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>250</LENGTH>
</COLUMN>
<COLUMN name="link">
<PATH syntax="XPath">/rss/channel/item/link</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>200</LENGTH>
</COLUMN>
<COLUMN name="pubDate">
<PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>40</LENGTH>
</COLUMN>
<COLUMN name="category">
<PATH syntax="XPath">/rss/channel/item/category</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>40</LENGTH>
</COLUMN>
</TABLE>
</SXLEMAP>
;
run;
/* WordPress feeds return data in pages, 25 entries at a time */
/* So using a short macro to loop through past 5 pages, or 125 items */
%macro getItems;
%do i = 1 %to 5;
filename feed temp;
proc http
method="get"
url="https://www.dailymaverick.co.za/dmrss?paged=&i."
out=feed;
run;
libname result XMLv2 xmlfileref=feed xmlmap=rssmap;
data posts_&i.;
set result.item;
run;
%end;
%mend;
%getItems;
/* Assemble all pages of entries */
/* Cast the date field into a proper SAS date */
/* Have to strip out the default day name abbreviation */
/* "Wed, 10 Apr 2019 17:36:27 +0000" -> 10APR2019 */
data allPosts ;
set posts_:;
length sasPubdate 8;
sasPubdate = input( substr(pubDate,4),anydtdtm.);
format sasPubdate dtdate9.;
drop pubDate;
run;
Hi , how can I scrape a website with load more button and no rss feed. Is there a way to call javascript that executes the "load more" functionality in sas
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.