DATA Step, Macro, Functions and more

Webcrawling website for certain webpage that has the word I am interesting to find.

Reply
Occasional Contributor
Posts: 5

Webcrawling website for certain webpage that has the word I am interesting to find.

 

 

 
I am trying to run the code to web crawl all the webpages in a certain website. After the web crawl is complete to search the webpages they found and see if the selected words can be located in the webpages. If the selected words are located, save the webpage. If the words are not located on the webpage, delete the webpage.
 
The website on the program and words are not the final but a prototype. The website and words are more detail and strenuous.
 
 
 
data work.links_to_crawl;
length url $256;
input url $;
datalines;
;
run;
 
 
%macro crawler();
%let html_num = 1;
 
data work.links_crawled;
length url $256;
run;
 
%next_crawl:
/* pop the next url off */
%let next_url = ;
 
data work.links_to_crawl;
set work.links_to_crawl;
if _n_ eq 1 then call symput("next_url", url);
else output;
run;
 
%let next_url = %trim(%left(&next_url));
 
%if "&next_url" ne "" %then %do;
 
%put crawling &next_url ... ;
 
/* crawl the url */
filename _nexturl url "&next_url";
 
/* put the file we crawled here */
filename htmlfile "file%trim(&html_num).html";
 
/* find more urls */
data work._urls(keep=url);
length url $256 ;
file htmlfile;
infile _nexturl length=len;
input text $varying2000. len;
 
put text;
 
start = 1;
stop = length(text);
 
if _n_ = 1 then do;
retain patternID;
pattern = '/href="([^"]+)"/i';
patternID = prxparse(pattern);
end;
 
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the start parameter so that searching */
/* begins again after the last match. */
call prxnext(patternID, start, stop, text, position, length);
do while (position ^= 0);
url = substr(text, position+6, length-7);
* put url=;
output;
call prxnext(patternID, start, stop, text, position, length);
end;
run;
 
/* add the current link to the list of urls we have already crawled */
data work._old_link;
url = "&next_url";
run;
proc append base=work.links_crawled data=work._old_link;
run;
 
/* only add urls that we haven't already crawled or that aren't queued up to be crawled */
proc sql noprint;
create table work._append as
select url
from work._urls
where url not in (select url from work.links_crawled)
and url not in (select url from work.links_to_crawl);
quit;
 
/* only add urls that are absolute (http://...) */
data work._append;
set work._append;
absolute_url = substrn(url, 1, 7);
put absolute_url=;
if absolute_url eq "http://" ;
drop absolute_url;
run;
 
/* add new links */
proc append base=work.links_to_crawl data=work._append force;
run;
 
/* increment our file number */
%let html_num = %eval(&html_num + 1);
 
/* loop */
%goto next_crawl;
%end;
 
%mend crawler;
 
%crawler();
data crawler();
*length chapter $200;
* infile eSUG length=len lrecl=32767;
*input line $varying32767. len;
uline= upcase(line);
if find(uline,"example") and find(uline,"dunbar") or find(uline,"principal")or find(uline,"doctor")
or find(uline,"DR")or find(uline,"football")or find(uline,"math")or find(uline,"english")
then do;
chapter=scan(line,2,'"');
output;
end;
run;
 
Super User
Posts: 22,874

Re: Webcrawling website for certain webpage that has the word I am interesting to find.

Is there a question here or are you sharing code?

Occasional Contributor
Posts: 5

Re: Webcrawling website for certain webpage that has the word I am interesting to find.

The end part of the program, the word search at the webpage level does not work. I was asking what I need to do for it to work. I apologize that I did not place that in the initial post.

Super User
Posts: 22,874

Re: Webcrawling website for certain webpage that has the word I am interesting to find.

It's not pointed to a file or anything and you've commented out the infile statements. 

 

And what does 'not work' mean?

 

You should also look at the findW instead of find and look at the third parameter to ignore case, FIND and find and Find are three different values. 

Ask a Question
Discussion stats
  • 3 replies
  • 81 views
  • 0 likes
  • 2 in conversation