02-13-2016 01:05 AM - edited 02-13-2016 01:05 AM
is it possible to use SAS to search the internet?
suppose I want to google "used cars", is it possible to get say the first 100 links into a sas file?
02-13-2016 10:05 AM
02-13-2016 01:13 AM
The term is web scraping.
If you search on lexjansen.com there are a bunch of papers with sample code.
Here's an example:
You may also want to look into if there's an API a which will allow you to send a request and get a JSON dataset in return that's in a more structured format.
02-13-2016 10:05 AM
02-13-2016 03:22 PM
I actually found your presentation and the examples at the OASUS site.
I did example 3 and obtained the distinct adresses which were found with google.
I also tried to do example 4 and get 1000 adresses, but I think that it freezes my SAS because the data is too big, is this possible?
02-13-2016 04:39 PM
Good stuff! I'm glad you're partway there.
I had similar things happen to me. I don't think it's a volume issue, as by SAS standards this is all fairly low volume.
I tried it again, but changed the macro loop to
%do i=1 %to 5 %by 5;
to only run the query once. It ran, but took a couple of minutes. I'm wondering if Google has added a "limiter" to slow things down, and prevent people from doing this kind of thing.
All I can suggest using this mechanism is to be patient, and certainly don't try to do 1000 at once.
Keep in mind, it's Google's world. They only let us live in it, sigh!
02-13-2016 06:10 PM
thanks for the reply, I guess that Google is actually trying to limit such behavior, maybe its related to making their advertisements more visible...
I would like to ask you another small quesiton if I may: I have found an example which is going to an employment website and obtaining the job postings. In this example the author uses Perl/LWP code. Can this Perl code be run on SAS, or another program is needed?
02-13-2016 06:23 PM
No, Perl code can't be run inside of SAS. However, if the Perl code is searching or replacing using Regular Expressions, the SAS PRX routines provide much of the same functionality, with pretty much the same sytntax.
Another option, depending on your SAS environment, is to run Perl using a SAS "X" command, and then acquire the Perl output in SAS.
02-13-2016 06:25 PM
Actually, now that I think about it, that would be pretty funny. Someone announces "a great new search engine", but all it does is pass the searches to Google, and list the results.
Sorry I didn't think of this sooner...I might have gotten a lot richer than writing SAS code!
02-13-2016 07:16 PM
In your slide in part 4 there is a code line:
prxid=prxparse('/(?<=<h3 class="r"><a href="\/url\?q=)[[:alnum:]- \._~:\/\?#\[\]@!\$''\(\)\*\+,;=]+(?=&)/o');
from what i understand its to find the url. By looking at the code, it serches for "a href" which is the beginning of the url, but how does SAS know where the url ends, unless here its different from a regular string and what SAS is actually doing is searching for the "url box" in the html?
and if that is the cases, does it mean that SAS can look for all the different "boxes" of html?
02-14-2016 01:13 PM
@TomKari It appears here, under the assumption that you're adding a Search window to your website. It's old...things changed from the last time I attempted this
There's a very short section on Keeping a Search Result
but it looks deprecated now
The new API looks to return XML that also could be parsed.
The old API doesn't indicate this is against usage but I'm not sure about the current one.
02-15-2016 09:12 PM
I did a little digging...the results are fascinating.
1) Google won't permit you to do the kind of thing that I described in my presentation, starting in around 2013.
2) Google had provided an alternative option, that was described in the link you passed on
3) But then they deprecated it, and replaced it with an option that you need to pay for.
03-21-2017 01:00 PM
Has there any update on this?
We are currently using the filename syntax to pull driving directions for a series of address sets, subject to daily limitations imposed by google. There are various sites, sugi papers, etc... that detail this methodology.
We are looking to utilize this funtionality to retrieve the results of the 1st page of google results for some &x &y &z query combination. I landed upon this page and I'm seeing great information on this. This was very helpful!
Need further help from the community? Please ask a new question.