Dear SAS support communities,
I tried to use base SAS to collect url lists from search engine but failed. If I have a keyword "SAS+training" url "http://www.google.co.jp/?gws_rd=cr#bav=on.2,or.&fp=4d59ced19cf73e3&q=sas%2Btraining", can I collect the result links? It'll be very helpful especially if I have more than 100 keywords to search.
Thank you in advance!
Tiffany
What did you try?
What where the issues?
Combining:
SAS(R) 9.4 Statements: Reference ( FILENAME Statement, URL Access Method )
SAS(R) 9.4 Statements: Reference ( input statement example 5 multiple files)
Should give an approach. Macros to the rescue
My purpose is to collect the result links after searching a keyword from google instead of click and browse. I can successfully collect data in a static website like example 3 below.
filename foo url
'http://support.sas.com/techsup/service_intro.html';
data _null_;
infile foo length=len;
input record $varying200. len;
put record $varying200. len;
if _n_=15 then stop;
run;
However, if I put a google search page after url, I can't get the result links.
filename foo url
"http://www.google.co.jp/?gws_rd=cr#bav=on.2,or.&fp=4d59ced19cf73e3&q=sas%2Btraining" ;
;
data _null_;
infile foo length=len;
input record $varying200. len;
put record $varying200. len;
if _n_=15 then stop;
run;
Thank you for your help.
Tiffany, am I seeing ' (single ) and " (double) being used as char?
The single ' usage will keep tekst untouched by sas-macro interpreter.
The single "' usage will let the sas-macro interpreter change the string as he likes. Using &-chars and %chars (possible more) .
Changing from " to ' should help. For simple testing the stream single name... (untested)
filename foo url 'http://www.google.co.jp/?gws_rd=cr#bav=on.2,or.&fp=4d59ced19cf73e3&q=sas%2Btraining' ;
data _null_ ;
infile foo length=len lrecl=? ; /* check lrecl setting as the maximum lenth is often 255 */
input; put _infie_ ; /* _infile_ is an automatic variable containing the whole record */
run ;
Jaap,
I got 4 observations which are not related to the search result. The data looks like the homepage of Google. I couldn't find any "SAS" or "training" in the output. I tried Yahoo which is the same....
Your help will be great appreciated.
The good thing is you are getting response. Japanese domain of google.
How buidling the webadresline is documented at google. You looks to have copied one generated by a web-page. the %2B is the + sign replacement probably done by te browser.
Search Protocol Reference (google http:\\ developers.google.com/search-appliance/documentation/46/xml_reference#request_format ) I crashed the adress to have it shown
You are sure to have changed the quotes to the lowercase one the single ' (us kb layout)?
Did you increase te reclen to 32767? Just 4 records, there must be a lot more data.
-- http does not know/use common record length of Windows/Unix it ís using css and html possivle some scritping being used for layout.
Message was edited by: Jaap Karman (added reclen note)
Jaap, thank you so much. It's helpful!
Hi Tiffany,
Can you please post when you receive a successful outcome? I could be very interested to see how things progress.
Regards,
Scott
And scott went on.... https://communities.sas.com/message/178205#178205
Hi Scott and Jaap,
I can extract data from webpage but still fail in extract data from search engine like google and yahoo.
Even though search engines may be the largest web scraper applications, they do not take too kindly to being scraped themselves. Recommended reading:
Scraping Google for Fun and Profit
The process followed by this site could be replicated, without major headaches, in SAS, to move you along the path without any of the recommended items for using proxies your url should represent the following:
filename i url 'http://www.google.com/search?q=sas+training&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&cli...
It is the addition of the browser information to the GET method which will return actual results rather than being interrupted by google's first minor layer of protection from scraping.
FriedEgg has a good argument. Going to Spam or Scam with this kind of facilties will bring nothing.
At the same time his link is telling it should be technical possible to build that kind of functionality.
As long as you stay at normal behavior with a normal usage like a real person it should be able to get to work.
I wish I had a SAS installation (privately) available to investigate as having no access to SAS at the moment.
Always found it too expensive for home usage spending that much time with it at work.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.