BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
robm
Quartz | Level 8

I have a series of html files I need to parse they are of the form: http://shakespeare.mit.edu//full.html where each directory has a html file called full.html ....how can I iterate through and read these files?

1 ACCEPTED SOLUTION

Accepted Solutions
Cynthia_sas
SAS Super FREQ

Hi:

  I get a 404 Page Not Found when I try your URL. I think it's wrong anyway, because I don't think a URL can have double // in anyplace except after the http://, but even if I try this: http://shakespeare.mit.edu/full.html it doesn't work.

  However, if I go down to the page for All's Well that End's Well at this URL (http://shakespeare.mit.edu/allswell/allswell.1.1.html), I can read the HTML file with the URL engine, however, what comes back is just the HTML on the page (all the tags and text), as you can see in the screen shot of my output window. It would be up to you to write a program to parse the tags, if, for example, you wanted to identify the speeches or make a list of characters.

  The index.html file for the site, has the list of files that you can link to, but you would have to get to each of these directories to further grab the HTML on each site. This is shown below in the section of code that reads the INDEX.HTML file for the ALLSWELL subdirectory on the site.

  I suppose you could automate everything and iterate with a SAS Macro program, but you first have to get everything working for 1 play before you can automate this for all the plays. And I'm not sure the point of just READING the full HTML of a Shakespeare play with SAS. What is the end result you are looking for??

Cynthia

** go to main index.html page;
filename main url 'http://shakespeare.mit.edu/index.html'
         url debug;
     
ods _all_ close;
title 'INDEX.HTML for Shakespeare site';
ods listing;
   data _null_;
     infile main;
     file print;
     input;
     put _infile_;
   run;
        
** read index.html to find how the plays are organized;
** every play has a separate index.html file in a separate directory;
filename alltop url
   'http://shakespeare.mit.edu/allswell/index.html'
  debug;
       
title 'INDEX.HTML for Allswell play';
ods listing;
   data _null_;
     infile alltop;
     file print;
     input;
     put _infile_;
   run;

    

** read the first section of the Alls Well that Ends Well site;
filename allswell url
   'http://shakespeare.mit.edu/allswell/allswell.1.1.html'
  debug;
       
title 'Alls Well that Ends Well section 1.1'; 
ods _all_ close;
ods listing;
   data _null_;
     infile allswell;
     file print;
     input;
     put _infile_;
   run;

** or read the FULL play;

filename allsfull url

   'http://shakespeare.mit.edu/allswell/full.html'

  debug;

    

title 'Alls Well that Ends Well full play '; 

ods _all_ close;

ods listing;

   data _null_;

     infile allsfull;

     file print;

     input;

     put _infile_;

   run;

View solution in original post

5 REPLIES 5
Reeza
Super User

Bad link, but you can look into PROC HTTP for starters.

Cynthia_sas
SAS Super FREQ

Hi:

  I get a 404 Page Not Found when I try your URL. I think it's wrong anyway, because I don't think a URL can have double // in anyplace except after the http://, but even if I try this: http://shakespeare.mit.edu/full.html it doesn't work.

  However, if I go down to the page for All's Well that End's Well at this URL (http://shakespeare.mit.edu/allswell/allswell.1.1.html), I can read the HTML file with the URL engine, however, what comes back is just the HTML on the page (all the tags and text), as you can see in the screen shot of my output window. It would be up to you to write a program to parse the tags, if, for example, you wanted to identify the speeches or make a list of characters.

  The index.html file for the site, has the list of files that you can link to, but you would have to get to each of these directories to further grab the HTML on each site. This is shown below in the section of code that reads the INDEX.HTML file for the ALLSWELL subdirectory on the site.

  I suppose you could automate everything and iterate with a SAS Macro program, but you first have to get everything working for 1 play before you can automate this for all the plays. And I'm not sure the point of just READING the full HTML of a Shakespeare play with SAS. What is the end result you are looking for??

Cynthia

** go to main index.html page;
filename main url 'http://shakespeare.mit.edu/index.html'
         url debug;
     
ods _all_ close;
title 'INDEX.HTML for Shakespeare site';
ods listing;
   data _null_;
     infile main;
     file print;
     input;
     put _infile_;
   run;
        
** read index.html to find how the plays are organized;
** every play has a separate index.html file in a separate directory;
filename alltop url
   'http://shakespeare.mit.edu/allswell/index.html'
  debug;
       
title 'INDEX.HTML for Allswell play';
ods listing;
   data _null_;
     infile alltop;
     file print;
     input;
     put _infile_;
   run;

    

** read the first section of the Alls Well that Ends Well site;
filename allswell url
   'http://shakespeare.mit.edu/allswell/allswell.1.1.html'
  debug;
       
title 'Alls Well that Ends Well section 1.1'; 
ods _all_ close;
ods listing;
   data _null_;
     infile allswell;
     file print;
     input;
     put _infile_;
   run;

** or read the FULL play;

filename allsfull url

   'http://shakespeare.mit.edu/allswell/full.html'

  debug;

    

title 'Alls Well that Ends Well full play '; 

ods _all_ close;

ods listing;

   data _null_;

     infile allsfull;

     file print;

     input;

     put _infile_;

   run;

robm
Quartz | Level 8

hey cool thanks

robm
Quartz | Level 8

one other thing how would I check each line looking for "index.html"

line in this line

<a href="allswell/index.html">


identify that it has index.html in it then strip out


allswell/index.html  so that i can use that to build


'http://shakespeare.mit.edu/allswell/full.html'





sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 978 views
  • 0 likes
  • 3 in conversation