BookmarkSubscribeRSS Feed
Tiffany1
Calcite | Level 5

Dear SAS support communities,

I tried to use base SAS to collect url lists from search engine but failed. If I have a keyword "SAS+training" url "http://www.google.co.jp/?gws_rd=cr#bav=on.2,or.&fp=4d59ced19cf73e3&q=sas%2Btraining", can I collect the result links?  It'll be very helpful especially if I have more than 100 keywords to search.

Thank you in advance!

Tiffany

11 REPLIES 11
jakarman
Barite | Level 11

What did you try?

What where the issues?

Combining:

SAS(R) 9.4 Statements: Reference ( FILENAME Statement, URL Access Method )

SAS(R) 9.4 Statements: Reference ( input statement  example 5 multiple files)

Should give an approach.     Macros to the rescue

---->-- ja karman --<-----
Tiffany1
Calcite | Level 5

My purpose is to collect the result links after searching a keyword from google instead of click and browse. I can successfully collect data in a static website like example 3 below.

filename foo url

  'http://support.sas.com/techsup/service_intro.html';

  

data _null_;

  infile foo length=len;

  input record $varying200. len;

  put record $varying200. len;

  if _n_=15 then stop;

run;

However, if I put a google search page after url, I can't get the result links.

filename foo url

"http://www.google.co.jp/?gws_rd=cr#bav=on.2,or.&fp=4d59ced19cf73e3&q=sas%2Btraining" ;

data _null_;

  infile foo length=len;

  input record $varying200. len;

  put record $varying200. len;

  if _n_=15 then stop;

run;

Thank you for your help.

jakarman
Barite | Level 11

Tiffany,  am I seeing ' (single ) and " (double) being used as char?

The single ' usage will keep tekst untouched by sas-macro interpreter.

The single "' usage will let the sas-macro interpreter change the string as he likes.  Using &-chars and %chars (possible more) .   

Changing from " to ' should help.    For simple testing the stream single name... (untested)

filename foo url  'http://www.google.co.jp/?gws_rd=cr#bav=on.2,or.&fp=4d59ced19cf73e3&q=sas%2Btraining' ;

data _null_ ;

   infile foo  length=len lrecl=? ; /* check lrecl setting as the maximum lenth is often 255 */

   input;  put _infie_ ;                 /* _infile_ is an automatic variable containing the whole record */

run ; 

---->-- ja karman --<-----
Tiffany1
Calcite | Level 5

Jaap,

I got 4 observations which are not related to the search result. The data looks like the homepage of Google. I couldn't find any "SAS" or "training" in the output. I tried Yahoo which is the same....

Your help will be great appreciated.

jakarman
Barite | Level 11

The good thing is you are getting response.   Japanese domain of google.

How buidling the webadresline is documented at google. You looks to have copied one generated by a web-page. the %2B is the + sign replacement probably done by te browser.

Search Protocol Reference (google http:\\  developers.google.com/search-appliance/documentation/46/xml_reference#request_format ) I crashed the adress to have it shown

You are sure to have changed the quotes to the lowercase one the single ' (us kb layout)?

Did you increase te reclen to 32767? Just 4 records, there must be a lot more data.

-- http does not know/use common record length of Windows/Unix it ís using css and html possivle some scritping being used for layout.

Message was edited by: Jaap Karman (added reclen note)

---->-- ja karman --<-----
Tiffany1
Calcite | Level 5

Jaap, thank you so much. It's helpful!

Scott_Mitchell
Quartz | Level 8

Hi Tiffany,

Can you please post when you receive a successful outcome?  I could be very interested to see how things progress.

Regards,

Scott

Tiffany1
Calcite | Level 5

Hi Scott and Jaap,

I can extract data from webpage but still fail in extract data from search engine like google and yahoo.

FriedEgg
SAS Employee

Even though search engines may be the largest web scraper applications, they do not take too kindly to being scraped themselves.  Recommended reading:

Scraping Google for Fun and Profit

The process followed by this site could be replicated, without major headaches, in SAS, to move you along the path without any of the recommended items for using proxies your url should represent the following:

filename i url 'http://www.google.com/search?q=sas+training&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&cli...

It is the addition of the browser information to the GET method which will return actual results rather than being interrupted by google's first minor layer of protection from scraping.

jakarman
Barite | Level 11

FriedEgg has a good argument.  Going to Spam or Scam with this kind of facilties will bring nothing.

At the same time his link is telling it should be technical possible to build that kind of functionality.  
As long as you stay at normal behavior with a normal usage like a real person it should be able to get to work.

I wish I had a SAS installation (privately) available to investigate as having no access to SAS at the moment.

Always found it too expensive for home usage spending that much time with it at work.   

---->-- ja karman --<-----

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1276 views
  • 1 like
  • 4 in conversation