BookmarkSubscribeRSS Feed
kevin12
Fluorite | Level 6

 

 

 
I am trying to run the code to web crawl all the webpages in a certain website. After the web crawl is complete to search the webpages they found and see if the selected words can be located in the webpages. If the selected words are located, save the webpage. If the words are not located on the webpage, delete the webpage.
 
The website on the program and words are not the final but a prototype. The website and words are more detail and strenuous.
 
 
 
data work.links_to_crawl;
length url $256;
input url $;
datalines;
;
run;
 
 
%macro crawler();
%let html_num = 1;
 
data work.links_crawled;
length url $256;
run;
 
%next_crawl:
/* pop the next url off */
%let next_url = ;
 
data work.links_to_crawl;
set work.links_to_crawl;
if _n_ eq 1 then call symput("next_url", url);
else output;
run;
 
%let next_url = %trim(%left(&next_url));
 
%if "&next_url" ne "" %then %do;
 
%put crawling &next_url ... ;
 
/* crawl the url */
filename _nexturl url "&next_url";
 
/* put the file we crawled here */
filename htmlfile "file%trim(&html_num).html";
 
/* find more urls */
data work._urls(keep=url);
length url $256 ;
file htmlfile;
infile _nexturl length=len;
input text $varying2000. len;
 
put text;
 
start = 1;
stop = length(text);
 
if _n_ = 1 then do;
retain patternID;
pattern = '/href="([^"]+)"/i';
patternID = prxparse(pattern);
end;
 
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the start parameter so that searching */
/* begins again after the last match. */
call prxnext(patternID, start, stop, text, position, length);
do while (position ^= 0);
url = substr(text, position+6, length-7);
* put url=;
output;
call prxnext(patternID, start, stop, text, position, length);
end;
run;
 
/* add the current link to the list of urls we have already crawled */
data work._old_link;
url = "&next_url";
run;
proc append base=work.links_crawled data=work._old_link;
run;
 
/* only add urls that we haven't already crawled or that aren't queued up to be crawled */
proc sql noprint;
create table work._append as
select url
from work._urls
where url not in (select url from work.links_crawled)
and url not in (select url from work.links_to_crawl);
quit;
 
/* only add urls that are absolute (http://...) */
data work._append;
set work._append;
absolute_url = substrn(url, 1, 7);
put absolute_url=;
if absolute_url eq "http://" ;
drop absolute_url;
run;
 
/* add new links */
proc append base=work.links_to_crawl data=work._append force;
run;
 
/* increment our file number */
%let html_num = %eval(&html_num + 1);
 
/* loop */
%goto next_crawl;
%end;
 
%mend crawler;
 
%crawler();
data crawler();
*length chapter $200;
* infile eSUG length=len lrecl=32767;
*input line $varying32767. len;
uline= upcase(line);
if find(uline,"example") and find(uline,"dunbar") or find(uline,"principal")or find(uline,"doctor")
or find(uline,"DR")or find(uline,"football")or find(uline,"math")or find(uline,"english")
then do;
chapter=scan(line,2,'"');
output;
end;
run;
 
3 REPLIES 3
Reeza
Super User

Is there a question here or are you sharing code?

kevin12
Fluorite | Level 6

The end part of the program, the word search at the webpage level does not work. I was asking what I need to do for it to work. I apologize that I did not place that in the initial post.

Reeza
Super User

It's not pointed to a file or anything and you've commented out the infile statements. 

 

And what does 'not work' mean?

 

You should also look at the findW instead of find and look at the third parameter to ignore case, FIND and find and Find are three different values. 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 744 views
  • 0 likes
  • 2 in conversation