BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ilikesas
Barite | Level 11

Hi,

 

is it possible to use SAS to search the internet?

 

suppose I want to google "used cars", is it possible to get say the first 100 links into a sas file?

 

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
TomKari
Onyx | Level 15

Here's another presentation that describes how to do exactly what you're discussing.

 

www.oasus.ca/OASUS_20130612_files/3_Scraping_the_Web_with_SAS/3_Scraping_the_Web_with_SAS.ppsx

View solution in original post

16 REPLIES 16
Reeza
Super User

The term is web scraping. 

If you search on lexjansen.com there are a bunch of papers with sample code. 

 

Here's an example:

http://support.sas.com/resources/papers/proceedings12/121-2012.pdf

You may also want to look into if there's an API a which will allow you to send a request and get a JSON dataset in return that's in a more structured format. 

TomKari
Onyx | Level 15

Here's another presentation that describes how to do exactly what you're discussing.

 

www.oasus.ca/OASUS_20130612_files/3_Scraping_the_Web_with_SAS/3_Scraping_the_Web_with_SAS.ppsx

ilikesas
Barite | Level 11

Hi Tom,

 

I actually found your presentation and the examples at the OASUS site.

 

I did example 3 and obtained the distinct adresses which were found with google.

 

I also tried to do example 4 and get 1000 adresses, but I think that it freezes my SAS because the data is too big, is this possible?

 

 

Thank you!

TomKari
Onyx | Level 15

Good stuff! I'm glad you're partway there.

 

I had similar things happen to me. I don't think it's a volume issue, as by SAS standards this is all fairly low volume.

 

I tried it again, but changed the macro loop to

%do i=1 %to 5 %by 5;

 

to only run the query once. It ran, but took a couple of minutes. I'm wondering if Google has added a "limiter" to slow things down, and prevent people from doing this kind of thing.

 

All I can suggest using this mechanism is to be patient, and certainly don't try to do 1000 at once.

 

Keep in mind, it's Google's world. They only let us live in it, sigh!

 

Best,

  Tom

 

ilikesas
Barite | Level 11

Hi Tom,

 

thanks for the reply, I guess that Google is actually trying to limit such behavior, maybe its related to making their advertisements more visible...

 

I would like to ask you another small quesiton if I may: I have found an example which is going to an employment website and obtaining the job postings. In this example the author uses Perl/LWP code. Can this Perl code be run on SAS, or another program is needed?

 

 

Thank you!

TomKari
Onyx | Level 15

No, Perl code can't be run inside of SAS. However, if the Perl code is searching or replacing using Regular Expressions, the SAS PRX routines provide much of the same functionality, with pretty much the same sytntax.

 

Another option, depending on your SAS environment, is to run Perl using a SAS "X" command, and then acquire the Perl output in SAS.

 

Tom

TomKari
Onyx | Level 15

Actually, now that I think about it, that would be pretty funny. Someone announces "a great new search engine", but all it does is pass the searches to Google, and list the results.

 

Sorry I didn't think of this sooner...I might have gotten a lot richer than writing SAS code!

ilikesas
Barite | Level 11

Hi Tom,

In your slide in part 4 there is a code line:

 

prxid=prxparse('/(?<=<h3 class="r"><a   href="\/url\?q=)[[:alnum:]-  \._~:\/\?#\[\]@!\$''\(\)\*\+,;=]+(?=&amp)/o');

 

from what i understand its to find the url. By looking at the code, it serches for "a href" which is the beginning of the url, but how does SAS know where the url ends, unless here its different from a regular string and what SAS is actually doing is searching for the "url box" in the html?

 

and if that is the cases, does it mean that SAS can look for all the different "boxes" of html?

 

thanks!

TomKari
Onyx | Level 15

Regular Expressions are a complex subject. Here's a slide from my presentation, that attempts to describe how the regular expression is composed.

 

PRX.jpg

 

Reeza
Super User

They provide an interface for valid users to scrape via the APIs. 

This prevents things like @TomKari idea. 

TomKari
Onyx | Level 15

Thanks, @Reeza!

 

Can you point at some documentation about this? I looked for it when I did this, a few years ago, but didn't find anything.

 

Tom

Reeza
Super User

@TomKari It appears here, under the assumption that you're adding a Search window to your website. It's old...things changed from the last time I attempted this 🙂

 

There's a very short section on Keeping a Search Result

but it looks deprecated now 😞

https://developers.google.com/web-search/docs/

 

The new API looks to return XML that also could be parsed. 

The old API doesn't indicate this is against usage but I'm not sure about the current one.

 

 

 

 

TomKari
Onyx | Level 15

Thanks, Reeza

 

I did a little digging...the results are fascinating.

 

Turns out:

1) Google won't permit you to do the kind of thing that I described in my presentation, starting in around 2013.

2) Google had provided an alternative option, that was described in the link you passed on

3) But then they deprecated it, and replaced it with an option that you need to pay for.

 

Big surprise!

   Tom

Data_Detective_23219
Calcite | Level 5

Has there any update on this?

 

We are currently using the filename syntax to pull driving directions for a series of address sets, subject to daily limitations imposed by google.  There are various sites, sugi papers, etc... that detail this methodology. 

 

We are looking to utilize this funtionality to retrieve the results of the 1st page of google results for some &x &y &z query combination.  I landed upon this page and I'm seeing great information on this.  This was very helpful!

 

 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 16 replies
  • 2979 views
  • 6 likes
  • 4 in conversation