Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

Reply
Occasional Contributor
Posts: 15

Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

Hi all,

I am trying to read in some files currently stored in HTML format. They were sent to me by a client. They currently reside in a folder in my network drive, and basically contain a link, that when clicked, redirects me to a webpage with the information that I need.

I am aware I can save the HTML pages as a PDF and proceed in that manner, but there is such a large quantity of files that I am looking for a more efficient way.

Does anyone know if it is possible to write a code that "fetches" the link to the HTML file and then either saves the file as a CSV, XLS, or presents it in a manner that I can directly extract the information I need?

Thank you!

Super Contributor
Posts: 297

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

Hi Opatil,

The best person to talk to about this is ,.He has helped me out a few times and has always been outstanding.

Regards,

Scott 

Trusted Advisor
Posts: 1,300

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

I'm glad you appreciated my previous posts @Scott_Mitchell.

@opatil, as @Reeza mentions, to better help you we will need an example of what you are trying to process.  What you appear to want to do is certainly possible.  The basic approach would be:

1. read in the html file you have that contains a link, search for and collect the link address

2. using either of the two most common access methods, collect the data from the link

     a. filename url

     b. proc http

Below is the link to a the very similar question I helped Scott_Mitchell with:

Occasional Contributor
Posts: 15

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

, Thank you for the help! I am unable to attach an example, but I am wondering if this process will still work for an HTML file that is saved on my computer, and not coming directly from the internet?

I have been able to program pulling a regular external website, but have trouble when I try to change the filepath to something on my personal computer.

Thank you again!

Super User
Posts: 17,784

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

It doesn't matter where the file is, web or computer, but you should be able to mock up an example that points to a public place that mimics your problem, even if it isn't exactly the same.

Trusted Advisor
Posts: 1,300

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

If you would like more specific help, you will need to post an example of the data you wish to parse.  Something like the following:

Say we have a hypothetical html file as follows:

C:\path\to\my\file.html

<html>

<body>

<a href="http://www.google.com"></a>

</body>

</html>

data _null_;

infile "C:\path\to\my\file.html" dlm='>';

input @"href=" link : $128.;

put link;

run;




In the LOG:

"http://www.google.com"




Now, instead of printing we could instead do something else with the extracted link, such as download it's content...

Occasional Contributor
Posts: 15

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

I attached an example in my response to Reeza.

How do I alter the code to get the state name and abbreviation, as opposed to the link? I tried the code above and it resulted in a table with the available links to all the different states.

Thank you!

Occasional Contributor
Posts: 15

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

Alternatively, I tried using the following code with the itntention to then extract the data that I need:

filename test "C:\Users\opatil\Desktop\New folder\State Abbreviations.htm";

     proc http

            url = "file:///C:/Users/opatil/Desktop/New%20folder/State%20Abbreviations.htm#.U49Pb_ldVMU"

            out = test method = "get";

run;

However, I keep getting the following error:

ERROR: Unable to connect to Web server, errno = 10061 (The connection was refused.).

Trusted Advisor
Posts: 1,300

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

You should not be using proc http unless you are actually interfacing with the http protocol.  You already have the file locally.

Super User
Posts: 17,784

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

Yes, can you post what your html file might look like or an example?

Occasional Contributor
Posts: 15

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

Thank you, Reeza.

I have attached a simplified example. What would be some basic code for fetching/opening the HTML file and then extracting the State Name and Abbreviation? Is it possible to directly create a SAS dataset with this information?

Thanks!

Attachment
Trusted Advisor
Posts: 1,300

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

filename st url "http://www.50states.com/abbreviations.htm";

*filename st "C:\path\to\file.html";

data _null_;

infile st dlm='<>';

input @'<td><a href=' link : $128. state : $32. @'<td>' abbr : $2.;

put abbr state link;

run;

Occasional Contributor
Posts: 15

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

Thank you.

I tried using my actual file, and it says "specified address not available" - I think it is because the file is from a data management system and not publicly available on the web, so there is no general URL I can use in the file name statement.

Have you encountered this before?

Super User
Posts: 17,784

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

Is your SAS on a server that can't access the web or your network drives?

Trusted Advisor
Posts: 1,300

Re: Converting HTML Page In Computer Folder To Excel, CSV, or Reading In Directly?

Please look at the commented filename in my example.  You should not be specifying a file url protocol or using the URL filename engine.  Just access the file by it's local path.

filename st "C:/Users/opatil/Desktop/New%20folder/State%20Abbreviations.htm#.U49Pb_ldVMU";

data _null_;

infile st dlm='<>';

input @'<td><a href=' link : $128. state : $32. @'<td>' abbr : $2.;

put abbr state link;

run;

Ask a Question
Discussion stats
  • 19 replies
  • 521 views
  • 6 likes
  • 5 in conversation