01-10-2012 09:10 PM
I need to extract messy data from the website. Could anyone recommend a good textbook that covers how to extract data efficiently from the web, plz?
01-10-2012 10:48 PM
Do you mean automatically or as in copy/paste? If it is the latter, I'll be doing an SGF presentation on the topic in April, titled 'Copy and Paste Almost Anything'. I already presented a draft of the paper at one of my local user group meetings and you can find it at:
01-10-2012 11:05 PM
I meant automatically. For example I would like to learn PROC (with many optional statements) that extracts data from the HTML file if I give it a address of a website or .html file directory.
maybe there isn't one? Then I would have to use DATA steps with a lot of @<tag> arguments in INPUT statement, which would not be very practical.
But thanks for the link. I will have a look, looks promising.
01-10-2012 11:08 PM
Then you want to look into proc html and proc soap. Do a search on the discussion forums for either. If you include my id or friedeggs id in the search, I'm sure that will help to eliminate much of the noise.
01-11-2012 12:06 AM
Yes. You can do it.
filename x url 'http://www.sas.com'; data want(where=(line is not missing)); infile x dsd dlm='<>' lrecl=32767; input @ '>' line : $400. @@; run;