04-03-2017 09:49 PM
I am trying to extract data from the Securities and Exchange Comission.
I wrote a code which brings me directly to a link and from there I want to extract the links to other data.
filename link url "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=o&type=DEF+14&dateb=&owner=exclude&count=10"; data web; infile link length=len lrecl=32767; input line $varying32767. len; p=find(line,'a href="Archives/edgar/data/'); if p then do; output; end; run;
Once at the link, there will be 10 "Documents" buttons, and it is their link that I am trying to extract. But I get the error message that "the message received was unexpected or badly formatted". Is there a way to remedie this?
04-03-2017 09:58 PM
SAS is probably the last tool I'd use for webscraping.
Import.io is a free and easy to use tool.
Selenium is another free and slightly more difficult to use tool.
If you're on a Mac the built in Automator has several examples.
04-05-2017 04:05 PM - edited 04-05-2017 04:07 PM
I noticed on your SEC link that right at the top of the results table there is an RSS link. This is essentially XML formatted data that is a bit easier to read into SAS (using the RSS link and passing in &count=100 for more data):
/* temp location for XML data */ filename resp temp; /* get request from sec api */ proc http url="https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000726728&CIK=0000726728&type=DEF%2014%25&dateb=&owner=exclude&start=0&count=100&output=atom" method= "GET" out=resp; run; /* use automap with XML libname engine */ filename tempMap Temp; libname sec xmlv2 xmlfileref=resp xmlmap=tempMap automap=replace; /* copy data to work to view more details */ proc copy in=sec out=work; run;
You can peruse the output files in WORK to pull out the pieces you may need (or join with the other tables). A lot more information on using XML in SAS is here: http://support.sas.com/rnd/base/xmlengine/ (my example just scratches the surface)
Also, a web search on "SEC API" points to other sources as well.
Hope this helps.