DATA Step, Macro, Functions and more

Scrape data from a webpage

Accepted Solution Solved
Reply
Contributor
Posts: 44
Accepted Solution

Scrape data from a webpage

[ Edited ]

How do you save the titles on the 'US - Based Outbreak' container using SAS?

https://www.cdc.gov/outbreaks/index.html

 

 

 

 

 

(EDIT via Reeza to fix the link)


Accepted Solutions
Solution
‎09-13-2017 11:14 AM
Community Manager
Posts: 3,384

Re: Scrape data from a webpage

I noticed that the CDC offers a lot of RSS feeds -- XML representations of the data on their site.

 

Using SAS, you can use PROC HTTP to fetch the XML, and then the XMLV2 libname engine to read that information as data.

 

You'll have to find the proper RSS feed for your needs.  They have many of them listed here.  Here's a working example with one of their feeds.

 

filename rssmap temp;
data _null_;
infile datalines;
file rssmap;
input;
put _infile_;
datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
    <NAMESPACES count="0"/>
    <!-- ############################################################ -->
    <TABLE name="item">
        <TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
        <COLUMN name="title">
            <PATH syntax="XPath">/rss/channel/item/title</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>250</LENGTH>
        </COLUMN>
        <COLUMN name="link">
            <PATH syntax="XPath">/rss/channel/item/link</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>200</LENGTH>
        </COLUMN>
        <COLUMN name="pubDate">
            <PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>40</LENGTH>
        </COLUMN>
    </TABLE>
</SXLEMAP>
;
run;


filename feed temp;
proc http
 method="get"
 url="https://www2c.cdc.gov/podcasts/createrss.asp?t=r&c=429."
 out=feed;
run;

libname result XMLv2 xmlfileref=feed xmlmap=rssmap;

data bulletins;
 set result.item;
 length date 8; 
 format date datetime20.;
 date = input( substr(pubDate,4),anydtdtm.);
 drop pubDate;
run;

 

Result:

 

rssfeed.png

View solution in original post


All Replies
Valued Guide
Posts: 576

Re: Scrap data from a webpage

I get "Page not Found" when I click on your link but in any case SAS isn't a Web Scraping Tool - I'd use something like the Python library Beautiful Soup for that

Contributor
Posts: 44

Re: Scrap data from a webpage

Posted in reply to ChrisBrooks

I don't know why you are not getting the link to open, i checked it again and it works. The reason I wanted to try in in SAS is I wanted to integrate it with my existing SAS reports. Thank you for the response.

Community Manager
Posts: 3,384

Re: Scrap data from a webpage

I think @Reeza edited the post and fixed the link for you -- that's why it works now. 

 

And I found the RSS feed you need for that category:

 

 


filename feed temp;
proc http
method="get"
url="https://tools.cdc.gov/api/v2/resources/media/285676.rss"
out=feed;
run;

 

feed2.png

 

Contributor
Posts: 44

Re: Scrap data from a webpage

Posted in reply to ChrisHemedinger

That explains it. Thank you all.

Solution
‎09-13-2017 11:14 AM
Community Manager
Posts: 3,384

Re: Scrape data from a webpage

I noticed that the CDC offers a lot of RSS feeds -- XML representations of the data on their site.

 

Using SAS, you can use PROC HTTP to fetch the XML, and then the XMLV2 libname engine to read that information as data.

 

You'll have to find the proper RSS feed for your needs.  They have many of them listed here.  Here's a working example with one of their feeds.

 

filename rssmap temp;
data _null_;
infile datalines;
file rssmap;
input;
put _infile_;
datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
    <NAMESPACES count="0"/>
    <!-- ############################################################ -->
    <TABLE name="item">
        <TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
        <COLUMN name="title">
            <PATH syntax="XPath">/rss/channel/item/title</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>250</LENGTH>
        </COLUMN>
        <COLUMN name="link">
            <PATH syntax="XPath">/rss/channel/item/link</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>200</LENGTH>
        </COLUMN>
        <COLUMN name="pubDate">
            <PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>40</LENGTH>
        </COLUMN>
    </TABLE>
</SXLEMAP>
;
run;


filename feed temp;
proc http
 method="get"
 url="https://www2c.cdc.gov/podcasts/createrss.asp?t=r&c=429."
 out=feed;
run;

libname result XMLv2 xmlfileref=feed xmlmap=rssmap;

data bulletins;
 set result.item;
 length date 8; 
 format date datetime20.;
 date = input( substr(pubDate,4),anydtdtm.);
 drop pubDate;
run;

 

Result:

 

rssfeed.png

Contributor
Posts: 44

Re: Scrape data from a webpage

Posted in reply to ChrisHemedinger

Thank you. Is there a document published you recommend for me to review?

Super Contributor
Posts: 285

Re: Scrape data from a webpage

There is r package called 'rvest' developed by Hadley if you are comfortable with R.

New Contributor
Posts: 3

Re: Scrape data from a webpage

Posted in reply to SAS_inquisitive
New Contributor
Posts: 3

Re: Scrape data from a webpage

Posted in reply to SAS_inquisitive

filename rssmap temp;
data _null_;
infile datalines;
file rssmap;
input;
put _infile_;
datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
    <NAMESPACES count="0"/>
    <!-- ############################################################ -->
    <TABLE name="item">
        <TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
        <COLUMN name="title">
            <PATH syntax="XPath">/rss/channel/item/title</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>250</LENGTH>
        </COLUMN>
        <COLUMN name="link">
            <PATH syntax="XPath">/rss/channel/item/link</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>200</LENGTH>
        </COLUMN>
        <COLUMN name="pubDate">
            <PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>40</LENGTH>
        </COLUMN>
    </TABLE>
</SXLEMAP>
;
run;


filename feed temp;
proc http
 method="get"
 url="https://nps.magicbricks.com/npsScript/nps.js?1.337"
 out=feed;
run;

libname result XMLv2 xmlfileref=feed xmlmap=rssmap;

data bulletins;
 set result.item;
 length date 8;
 format date datetime20.;
 date = input( substr(pubDate,4),anydtdtm.);
 drop pubDate;
run;

 

Error in xml

Community Manager
Posts: 3,384

Re: Scrape data from a webpage

@Pranjal - what are you trying to get from this "page"? The URL you supplied is a javascript file, not XML.  Please post the details of what you need in a different question, rather than add to this solved topic.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 493 views
  • 5 likes
  • 5 in conversation