Hi,
is it possible to use SAS to search the internet?
suppose I want to google "used cars", is it possible to get say the first 100 links into a sas file?
Thank you!
Here's another presentation that describes how to do exactly what you're discussing.
www.oasus.ca/OASUS_20130612_files/3_Scraping_the_Web_with_SAS/3_Scraping_the_Web_with_SAS.ppsx
The term is web scraping.
If you search on lexjansen.com there are a bunch of papers with sample code.
Here's an example:
http://support.sas.com/resources/papers/proceedings12/121-2012.pdf
You may also want to look into if there's an API a which will allow you to send a request and get a JSON dataset in return that's in a more structured format.
Here's another presentation that describes how to do exactly what you're discussing.
www.oasus.ca/OASUS_20130612_files/3_Scraping_the_Web_with_SAS/3_Scraping_the_Web_with_SAS.ppsx
Hi Tom,
I actually found your presentation and the examples at the OASUS site.
I did example 3 and obtained the distinct adresses which were found with google.
I also tried to do example 4 and get 1000 adresses, but I think that it freezes my SAS because the data is too big, is this possible?
Thank you!
Good stuff! I'm glad you're partway there.
I had similar things happen to me. I don't think it's a volume issue, as by SAS standards this is all fairly low volume.
I tried it again, but changed the macro loop to
%do i=1 %to 5 %by 5;
to only run the query once. It ran, but took a couple of minutes. I'm wondering if Google has added a "limiter" to slow things down, and prevent people from doing this kind of thing.
All I can suggest using this mechanism is to be patient, and certainly don't try to do 1000 at once.
Keep in mind, it's Google's world. They only let us live in it, sigh!
Best,
Tom
Hi Tom,
thanks for the reply, I guess that Google is actually trying to limit such behavior, maybe its related to making their advertisements more visible...
I would like to ask you another small quesiton if I may: I have found an example which is going to an employment website and obtaining the job postings. In this example the author uses Perl/LWP code. Can this Perl code be run on SAS, or another program is needed?
Thank you!
No, Perl code can't be run inside of SAS. However, if the Perl code is searching or replacing using Regular Expressions, the SAS PRX routines provide much of the same functionality, with pretty much the same sytntax.
Another option, depending on your SAS environment, is to run Perl using a SAS "X" command, and then acquire the Perl output in SAS.
Tom
Actually, now that I think about it, that would be pretty funny. Someone announces "a great new search engine", but all it does is pass the searches to Google, and list the results.
Sorry I didn't think of this sooner...I might have gotten a lot richer than writing SAS code!
Hi Tom,
In your slide in part 4 there is a code line:
prxid=prxparse('/(?<=<h3 class="r"><a href="\/url\?q=)[[:alnum:]- \._~:\/\?#\[\]@!\$''\(\)\*\+,;=]+(?=&)/o');
from what i understand its to find the url. By looking at the code, it serches for "a href" which is the beginning of the url, but how does SAS know where the url ends, unless here its different from a regular string and what SAS is actually doing is searching for the "url box" in the html?
and if that is the cases, does it mean that SAS can look for all the different "boxes" of html?
thanks!
Regular Expressions are a complex subject. Here's a slide from my presentation, that attempts to describe how the regular expression is composed.
They provide an interface for valid users to scrape via the APIs.
This prevents things like @TomKari idea.
Thanks, @Reeza!
Can you point at some documentation about this? I looked for it when I did this, a few years ago, but didn't find anything.
Tom
@TomKari It appears here, under the assumption that you're adding a Search window to your website. It's old...things changed from the last time I attempted this 🙂
There's a very short section on Keeping a Search Result
but it looks deprecated now 😞
https://developers.google.com/web-search/docs/
The new API looks to return XML that also could be parsed.
The old API doesn't indicate this is against usage but I'm not sure about the current one.
Thanks, Reeza
I did a little digging...the results are fascinating.
Turns out:
1) Google won't permit you to do the kind of thing that I described in my presentation, starting in around 2013.
2) Google had provided an alternative option, that was described in the link you passed on
3) But then they deprecated it, and replaced it with an option that you need to pay for.
Big surprise!
Tom
Has there any update on this?
We are currently using the filename syntax to pull driving directions for a series of address sets, subject to daily limitations imposed by google. There are various sites, sugi papers, etc... that detail this methodology.
We are looking to utilize this funtionality to retrieve the results of the 1st page of google results for some &x &y &z query combination. I landed upon this page and I'm seeing great information on this. This was very helpful!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.