DATA Step, Macro, Functions and more

Parsing HTML with SAS

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 92
Accepted Solution

Parsing HTML with SAS

Hello,

I'm trying to use SAS to acquire data from the internet.  More specifically, I'm interested in financial data on publicly traded companies.  I have code which downloads the financial information.  This part works fine.  The issue is that the data may span across several observations.  My plan to is tag all the observations with relevant data and then use proc transpose to combine the multiple observations into one observation where I can use nested if statements and substrate/index functions to extract the information.  I'm attempting to use a DO UNTIL loop to mark the records with relevant information, but can't get it to work properly.  My program crashes whenever the data spans across more than one observation.  Please see my code below.  Does anyone have an idea of what I'm doing wrong?  Any help or advice would be greatly appreciated.  Thanks so much!


Regards,

Bill

*DOWNLOAD FINANCIAL DATA;

filename in url 'http://finance.yahoo.com/q/bs?s=AAPL' debug lrecl=8192;

DATA OUT.YAHOO_HTML;

infile in length=len;

input record $varying8192. len;

RUN;

*PARSE HTML;

%MACRO BAL_SHEET_ITEM(YAHOO_NAME,SAS_NAME);

DATA OUT.&SAS_NAME.1;

SET OUT.YAHOO_HTML;

SYMBOL='AAPL';

*LIMIT DATA SET TO RECORDS WITH RELEVANT INFO;

KEEPDATA = 0;

%LET BEG_DATA=INDEX(record,&YAHOO_NAME.); /* BEGINNING OF DATA*/

%LET END_DATA=index(record,' </td></tr><tr>');/*END OF DATA*/

IF &BEG_DATA. > 0 THEN DO UNTIL (&END_DATA.>0);

KEEPDATA=1;

END;

IF KEEPDATA=1;

RUN;

%MEND;

%BAL_SHEET_ITEM('Cash And Cash Equivalents',CASH_AND_EQUIV);

%BAL_SHEET_ITEM('Inventory',INVENTORY);

%BAL_SHEET_ITEM('Short Term Investments',SHORT_TERM_ASSETS);

%BAL_SHEET_ITEM('Net Receivables',NET_REC);

%BAL_SHEET_ITEM('Inventory',INVENTORY);

%BAL_SHEET_ITEM('Other Current Assets',OTH_CURR_ASSETS);

%BAL_SHEET_ITEM('Accounts Payable',ACCTS_PAYABLE);

%BAL_SHEET_ITEM('Short/Current Long Term Debt',SHORT_CURR_LT_DEBT);

%BAL_SHEET_ITEM('Other Current Liabilities',OTH_CURR_LIAB);

%BAL_SHEET_ITEM('Long Term Debt',LT_DEBT);

%BAL_SHEET_ITEM('Other Liabilities',OTH_LIAB);

%BAL_SHEET_ITEM('Deferred Long Term Liability Charges',DEF_LT_LIAB);

%BAL_SHEET_ITEM('Minority Interest',MIN_INT);

%BAL_SHEET_ITEM('Negative Goodwill',NEG_GOODWILL);


Accepted Solutions
Solution
‎05-18-2013 02:37 PM
Super User
Super User
Posts: 6,500

Re: Parsing HTML with SAS

Why not let SAS input statement handle it for you instead of stuffing the html into a character variable?

For example to get the column headers you could use something like:

data headers ;

infile in flowover ;

input @'Period Ending' @;

array datestr(4) $12 ;

do col=1 to 4;

   input @ '<b>'  datestr(col) $12. @;

end;

run;

If the problem is that key text fields like "Period Ending" is flowed across rows then perhaps you can pull in the file and write it back out without the end of line breaks.

You can begin to automate it by making the SYMBOL used in the URL a macro variable.

For the numbers you might need to first read in the value as a character string and then parse it to eliminate '&nbsp' or other HTML codes.

%let symbol=AAPL;

filename in url "http://finance.yahoo.com:80/q/bs?s=&symbol" lrecl=8192;

data cash ;

infile in flowover ;

symbol="&symbol";

input @'Cash And Cash Equivalents' @;

array cash (4) ;

do col=1 to 4;

   input @ '<td align="right">'  string $12. @;

   cash(col)=input(scan(string,1,'&<'),??comma12.);

end;

output;

stop;

drop col string;

run;

data _null_;

set ;

put (_all_) (=);

run;

symbol=AAPL cash1=12053000 cash2=10746000 cash3=27084000 cash4=18383000

View solution in original post


All Replies
Solution
‎05-18-2013 02:37 PM
Super User
Super User
Posts: 6,500

Re: Parsing HTML with SAS

Why not let SAS input statement handle it for you instead of stuffing the html into a character variable?

For example to get the column headers you could use something like:

data headers ;

infile in flowover ;

input @'Period Ending' @;

array datestr(4) $12 ;

do col=1 to 4;

   input @ '<b>'  datestr(col) $12. @;

end;

run;

If the problem is that key text fields like "Period Ending" is flowed across rows then perhaps you can pull in the file and write it back out without the end of line breaks.

You can begin to automate it by making the SYMBOL used in the URL a macro variable.

For the numbers you might need to first read in the value as a character string and then parse it to eliminate '&nbsp' or other HTML codes.

%let symbol=AAPL;

filename in url "http://finance.yahoo.com:80/q/bs?s=&symbol" lrecl=8192;

data cash ;

infile in flowover ;

symbol="&symbol";

input @'Cash And Cash Equivalents' @;

array cash (4) ;

do col=1 to 4;

   input @ '<td align="right">'  string $12. @;

   cash(col)=input(scan(string,1,'&<'),??comma12.);

end;

output;

stop;

drop col string;

run;

data _null_;

set ;

put (_all_) (=);

run;

symbol=AAPL cash1=12053000 cash2=10746000 cash3=27084000 cash4=18383000

Frequent Contributor
Posts: 92

Re: Parsing HTML with SAS

Tom,

Wow, thank you so much.  The code works great.  Your solution is way more efficient than what I had in mind.  Evidently, I must have sustained a severe head injury prior to thinking about this problem???  I'm busy modifying the code to combine and then transpose the data with the purpose of creating one table that has the symbol, balance sheet items, and financial statement dates.

Thanks again for your assistance.

-Bill

Frequent Contributor
Posts: 92

Re: Parsing HTML with SAS

Quick follow-up question: one of the balance sheet items (Long Term Debt) is completely contained in another balance sheet item (Short/Current Long Term Debt).  I want both of these variables.  Whenever I run the code the code below, I cannot download the values for 'Long Term Debt' since 'Short/Current Long Term Debt' comes first in the HTML.  The code pulls in the values for 'Short/Current Long Term Debt' instead of 'Long Term Debt'.  Does anyone have any suggestions on how to handle this type of situation?  Is there an option in the INPUT statement that will allow me to select the 2nd instance of a phrase?  Any help would be greatly appreciated.


Thank you!

Bill


%let symbol=AAPL;

filename in url "http://finance.yahoo.com:80/q/bs?s=&symbol" lrecl=8192;

data cash ;

infile in flowover ;

symbol="&symbol";

input @'Long Term Debt' @;

array cash (4) ;

do col=1 to 4;

   input @ '<td align="right">'  string $12. @;

   cash(col)=input(scan(string,1,'&<'),??comma12.);

end;

output;

stop;

drop col string;

run;

Valued Guide
Posts: 2,175

Re: Parsing HTML with SAS

Repeat that INPUT @'Long Term ' before loading the variables

Super User
Super User
Posts: 6,500

Re: Parsing HTML with SAS

Be more precise in what you search for?  I see this in the source for AAPL.

>Long Term Debt<

>Short/Current Long Term Debt<


So if you search for them with he leading > (from the <td> tag) then it will be unique.

Frequent Contributor
Posts: 92

Re: Parsing HTML with SAS

Thanks for the comments. 

Tom K. - Good to know.  For the webpage in question, the record length of 8192 appears to be working fine.  I'll keep your suggestion in mind when I start to branch out and look at other webpages.

Tom - I like your solution.  Currently I use an IF statement to determine when to repeat the input command.  I'll change my code to be more precise in my search which should improve execution.

Cheers,

Bill

Frequent Contributor
Posts: 92

Re: Parsing HTML with SAS

Peter,

Thank you for the suggestion.  It solved the problem.

Regards,

Bill

PROC Star
Posts: 1,093

Re: Parsing HTML with SAS

One other caution. It is definitely possible for HTML to exceed 8192 bytes. Whenever I pull it in, I stuff it into a 32767 character field, and then reduce it once I know what the max record length is.

Tom (the other Tom, that is!)

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 1536 views
  • 0 likes
  • 4 in conversation