SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

searching the internet

Accepted Solution Solved
Reply
Super Contributor
Posts: 413
Accepted Solution

searching the internet

[ Edited ]

Hi,

 

is it possible to use SAS to search the internet?

 

suppose I want to google "used cars", is it possible to get say the first 100 links into a sas file?

 

 

Thank you!


Accepted Solutions
Solution
‎02-13-2016 09:07 PM
PROC Star
Posts: 1,098

Re: searching the internet

Here's another presentation that describes how to do exactly what you're discussing.

 

www.oasus.ca/OASUS_20130612_files/3_Scraping_the_Web_with_SAS/3_Scraping_the_Web_with_SAS.ppsx

View solution in original post


All Replies
Super User
Posts: 17,905

Re: searching the internet

The term is web scraping. 

If you search on lexjansen.com there are a bunch of papers with sample code. 

 

Here's an example:

http://support.sas.com/resources/papers/proceedings12/121-2012.pdf

You may also want to look into if there's an API a which will allow you to send a request and get a JSON dataset in return that's in a more structured format. 

Solution
‎02-13-2016 09:07 PM
PROC Star
Posts: 1,098

Re: searching the internet

Here's another presentation that describes how to do exactly what you're discussing.

 

www.oasus.ca/OASUS_20130612_files/3_Scraping_the_Web_with_SAS/3_Scraping_the_Web_with_SAS.ppsx

Super Contributor
Posts: 413

Re: searching the internet

Hi Tom,

 

I actually found your presentation and the examples at the OASUS site.

 

I did example 3 and obtained the distinct adresses which were found with google.

 

I also tried to do example 4 and get 1000 adresses, but I think that it freezes my SAS because the data is too big, is this possible?

 

 

Thank you!

PROC Star
Posts: 1,098

Re: searching the internet

Good stuff! I'm glad you're partway there.

 

I had similar things happen to me. I don't think it's a volume issue, as by SAS standards this is all fairly low volume.

 

I tried it again, but changed the macro loop to

%do i=1 %to 5 %by 5;

 

to only run the query once. It ran, but took a couple of minutes. I'm wondering if Google has added a "limiter" to slow things down, and prevent people from doing this kind of thing.

 

All I can suggest using this mechanism is to be patient, and certainly don't try to do 1000 at once.

 

Keep in mind, it's Google's world. They only let us live in it, sigh!

 

Best,

  Tom

 

Super Contributor
Posts: 413

Re: searching the internet

Hi Tom,

 

thanks for the reply, I guess that Google is actually trying to limit such behavior, maybe its related to making their advertisements more visible...

 

I would like to ask you another small quesiton if I may: I have found an example which is going to an employment website and obtaining the job postings. In this example the author uses Perl/LWP code. Can this Perl code be run on SAS, or another program is needed?

 

 

Thank you!

PROC Star
Posts: 1,098

Re: searching the internet

No, Perl code can't be run inside of SAS. However, if the Perl code is searching or replacing using Regular Expressions, the SAS PRX routines provide much of the same functionality, with pretty much the same sytntax.

 

Another option, depending on your SAS environment, is to run Perl using a SAS "X" command, and then acquire the Perl output in SAS.

 

Tom

PROC Star
Posts: 1,098

Re: searching the internet

Actually, now that I think about it, that would be pretty funny. Someone announces "a great new search engine", but all it does is pass the searches to Google, and list the results.

 

Sorry I didn't think of this sooner...I might have gotten a lot richer than writing SAS code!

Super Contributor
Posts: 413

Re: searching the internet

Hi Tom,

In your slide in part 4 there is a code line:

 

prxid=prxparse('/(?<=<h3 class="r"><a   href="\/url\?q=)[[:alnum:]-  \._~:\/\?#\[\]@!\$''\(\)\*\+,;=]+(?=&amp)/o');

 

from what i understand its to find the url. By looking at the code, it serches for "a href" which is the beginning of the url, but how does SAS know where the url ends, unless here its different from a regular string and what SAS is actually doing is searching for the "url box" in the html?

 

and if that is the cases, does it mean that SAS can look for all the different "boxes" of html?

 

thanks!

PROC Star
Posts: 1,098

Re: searching the internet

Regular Expressions are a complex subject. Here's a slide from my presentation, that attempts to describe how the regular expression is composed.

 

PRX.jpg

 

Super User
Posts: 17,905

Re: searching the internet

They provide an interface for valid users to scrape via the APIs. 

This prevents things like @TomKari idea. 

PROC Star
Posts: 1,098

Re: searching the internet

Thanks, @Reeza!

 

Can you point at some documentation about this? I looked for it when I did this, a few years ago, but didn't find anything.

 

Tom

Super User
Posts: 17,905

Re: searching the internet

@TomKari It appears here, under the assumption that you're adding a Search window to your website. It's old...things changed from the last time I attempted this Smiley Happy

 

There's a very short section on Keeping a Search Result

but it looks deprecated now Smiley Sad

https://developers.google.com/web-search/docs/

 

The new API looks to return XML that also could be parsed. 

The old API doesn't indicate this is against usage but I'm not sure about the current one.

 

 

 

 

PROC Star
Posts: 1,098

Re: searching the internet

Thanks, Reeza

 

I did a little digging...the results are fascinating.

 

Turns out:

1) Google won't permit you to do the kind of thing that I described in my presentation, starting in around 2013.

2) Google had provided an alternative option, that was described in the link you passed on

3) But then they deprecated it, and replaced it with an option that you need to pay for.

 

Big surprise!

   Tom

Occasional Contributor
Posts: 14

Re: searching the internet

Has there any update on this?

 

We are currently using the filename syntax to pull driving directions for a series of address sets, subject to daily limitations imposed by google.  There are various sites, sugi papers, etc... that detail this methodology. 

 

We are looking to utilize this funtionality to retrieve the results of the 1st page of google results for some &x &y &z query combination.  I landed upon this page and I'm seeing great information on this.  This was very helpful!

 

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 16 replies
  • 673 views
  • 6 likes
  • 4 in conversation