SAS access Google/Yahoo search engine to colect url lists

Reply
Occasional Contributor
Posts: 6

SAS access Google/Yahoo search engine to colect url lists

Dear SAS support communities,

I tried to use base SAS to collect url lists from search engine but failed. If I have a keyword "SAS+training" url "http://www.google.co.jp/?gws_rd=cr#bav=on.2,or.&fp=4d59ced19cf73e3&q=sas%2Btraining", can I collect the result links?  It'll be very helpful especially if I have more than 100 keywords to search.

Thank you in advance!

Tiffany

Trusted Advisor
Posts: 3,214

Re: SAS access Google/Yahoo search engine to colect url lists

What did you try?

What where the issues?

Combining:

SAS(R) 9.4 Statements: Reference ( FILENAME Statement, URL Access Method )

SAS(R) 9.4 Statements: Reference ( input statement  example 5 multiple files)

Should give an approach.     Macros to the rescue

---->-- ja karman --<-----
Occasional Contributor
Posts: 6

Re: SAS access Google/Yahoo search engine to colect url lists

My purpose is to collect the result links after searching a keyword from google instead of click and browse. I can successfully collect data in a static website like example 3 below.

filename foo url

  'http://support.sas.com/techsup/service_intro.html';

  

data _null_;

  infile foo length=len;

  input record $varying200. len;

  put record $varying200. len;

  if _n_=15 then stop;

run;

However, if I put a google search page after url, I can't get the result links.

filename foo url

"http://www.google.co.jp/?gws_rd=cr#bav=on.2,or.&fp=4d59ced19cf73e3&q=sas%2Btraining" ;

data _null_;

  infile foo length=len;

  input record $varying200. len;

  put record $varying200. len;

  if _n_=15 then stop;

run;

Thank you for your help.

Trusted Advisor
Posts: 3,214

Re: SAS access Google/Yahoo search engine to colect url lists

Tiffany,  am I seeing ' (single ) and " (double) being used as char?

The single ' usage will keep tekst untouched by sas-macro interpreter.

The single "' usage will let the sas-macro interpreter change the string as he likes.  Using &-chars and %chars (possible more) .   

Changing from " to ' should help.    For simple testing the stream single name... (untested)

filename foo url  'http://www.google.co.jp/?gws_rd=cr#bav=on.2,or.&fp=4d59ced19cf73e3&q=sas%2Btraining' ;

data _null_ ;

   infile foo  length=len lrecl=? ; /* check lrecl setting as the maximum lenth is often 255 */

   input;  put _infie_ ;                 /* _infile_ is an automatic variable containing the whole record */

run ; 

---->-- ja karman --<-----
Occasional Contributor
Posts: 6

Re: SAS access Google/Yahoo search engine to colect url lists

Jaap,

I got 4 observations which are not related to the search result. The data looks like the homepage of Google. I couldn't find any "SAS" or "training" in the output. I tried Yahoo which is the same....

Your help will be great appreciated.

Trusted Advisor
Posts: 3,214

Re: SAS access Google/Yahoo search engine to colect url lists

The good thing is you are getting response.   Japanese domain of google.

How buidling the webadresline is documented at google. You looks to have copied one generated by a web-page. the %2B is the + sign replacement probably done by te browser.

Search Protocol Reference (google http:\\  developers.google.com/search-appliance/documentation/46/xml_reference#request_format ) I crashed the adress to have it shown

You are sure to have changed the quotes to the lowercase one the single ' (us kb layout)?

Did you increase te reclen to 32767? Just 4 records, there must be a lot more data.

-- http does not know/use common record length of Windows/Unix it ís using css and html possivle some scritping being used for layout.

Message was edited by: Jaap Karman (added reclen note)

---->-- ja karman --<-----
Occasional Contributor
Posts: 6

Re: SAS access Google/Yahoo search engine to colect url lists

Jaap, thank you so much. It's helpful!

Super Contributor
Posts: 297

Re: SAS access Google/Yahoo search engine to colect url lists

Hi Tiffany,

Can you please post when you receive a successful outcome?  I could be very interested to see how things progress.

Regards,

Scott

Trusted Advisor
Posts: 3,214

Re: SAS access Google/Yahoo search engine to colect url lists

And scott went on.... https://communities.sas.com/message/178205#178205

---->-- ja karman --<-----
Occasional Contributor
Posts: 6

Re: SAS access Google/Yahoo search engine to colect url lists

Hi Scott and Jaap,

I can extract data from webpage but still fail in extract data from search engine like google and yahoo.

Trusted Advisor
Posts: 1,301

Re: SAS access Google/Yahoo search engine to colect url lists

Even though search engines may be the largest web scraper applications, they do not take too kindly to being scraped themselves.  Recommended reading:

Scraping Google for Fun and Profit

The process followed by this site could be replicated, without major headaches, in SAS, to move you along the path without any of the recommended items for using proxies your url should represent the following:

filename i url 'http://www.google.com/search?q=sas+training&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:usSmiley Surprisedfficial&clien...

It is the addition of the browser information to the GET method which will return actual results rather than being interrupted by google's first minor layer of protection from scraping.

Trusted Advisor
Posts: 3,214

Re: SAS access Google/Yahoo search engine to colect url lists

FriedEgg has a good argument.  Going to Spam or Scam with this kind of facilties will bring nothing.

At the same time his link is telling it should be technical possible to build that kind of functionality.  
As long as you stay at normal behavior with a normal usage like a real person it should be able to get to work.

I wish I had a SAS installation (privately) available to investigate as having no access to SAS at the moment.

Always found it too expensive for home usage spending that much time with it at work.   

---->-- ja karman --<-----
Ask a Question
Discussion stats
  • 11 replies
  • 740 views
  • 1 like
  • 4 in conversation