DATA Step, Macro, Functions and more

A question about reading url txt files by SAS

Reply
Frequent Contributor
Posts: 75

A question about reading url txt files by SAS

Hello everyone,

I am trying to understand this piece of code whose aim is to read in a url address and extract some data from it.

so, the code is:

"

data SiteVisit;

filename foo url &SITE debug;

retain  line countA1 file_date form_type name cik ein fyr accession smbl lagline gvkey;

infile foo lrecl=256 pad expandtabs ;

input line $char256. ;

linecount=_n_;

If _n_ = 1 then do;

if line eq ' ' then return;

line1 = UPCASE(htmldecode(compress(line, ' ')));

"

what I cannot figure out exactlcy, is that what is exactly the input line $char256. part. When I run the code on a sample url, I am getting some html signs for the variable line. However, what I am trying to do, is to read the exact words from this html. May be this is because I know noting about html. but, Can someone please explain the above code for me please?

P.S: an example of the address that I am trying to read is this address:

http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt

When I run the code, I get 66 lines (i.e the last linecount is 66). why is this happening?

Valued Guide
Posts: 2,175

Re: A question about reading url txt files by SAS

It is plain text at that sample url so start more simply

data SiteVisit;

Filename foo url &SITE debug;

Length line 300;

infile foo lrecl= 1000;

Input;

Line = _infile_;

List;

If _n_ GT 100 then stop ;

run;

.that will show the text of the first 100 lines

Super User
Posts: 10,500

Re: A question about reading url txt files by SAS

The

input line $char256. ;

means read the first 256 characters as text (the $) into the variable line.

Peters solution is probably much better though you may want length line $ 300; or increase 300 to a larger value if you aren't getting all of the matching start/end elements.

Super User
Super User
Posts: 6,500

Re: A question about reading url txt files by SAS

What is the question?  The file seems perfectly readable, it does not truncate at 66 lines.

NOTE: 4820 records were read from the infile

      'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt'.

      The minimum record length was 0.

      The maximum record length was 131.

You can make your code to find the file simpler.

data _null_;

  infile 'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt'

   url expandtabs truncover ;

  input ;

* list;

run;

Valued Guide
Posts: 3,208

Re: A question about reading url txt files by SAS

The HTML does not know anything about records it is about tags.

Reading an URL with records will be successful when the HTML coding has be done with an code editor structure showing that lay-out. But there is no guarantee on that. Some  basics on HTML is necessary.

---->-- ja karman --<-----
Super User
Posts: 6,938

Re: A question about reading url txt files by SAS

Since you try to read with the $256. format, every time you have a line in the input file with less that 256 characters, SAS skips to the next line, until there are enough characters on one line. Obviously you have "only" 66 lines in the input file with > 256 characters.

You will find a NOTE in the log about this like "SAS went to a new line ...."

You need Tom's truncover option to avoid this (default) behaviour, which is "missover".

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Super User
Posts: 6,500

Re: A question about reading url txt files by SAS

Default is FLOWOVER which will attempt to get the data from the next line.

MISSOVER is probably the problem as it says to NOT go to the next line and to set any field where there are not enough characters for the informat requested to MISSING.  So if you try to read 256 characters from a line that only has 80 characters you end up with a variable value that is blank.

TRUNCOVER will also not move to the next line, but unlike missover it will not throw away the partial information when the input line is shorter than the input statement requested.

Super User
Posts: 6,938

Re: A question about reading url txt files by SAS

Yeah, went wrong there. Basically SAS keeps skipping lines until it has enough characters to satisfy the $256. format. This means it a) reduces the number of lines and b) discards a lot of data when, say, 10 characters are needed to fill the 256, but the next line contains 50 characters (40 characters dropped).

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Valued Guide
Posts: 3,208

Re: A question about reading url txt files by SAS

Hi guys, nice to go into technical details if record processing with fixed lengths in a world of HTML not being aware of those kind of things. Thinking in variable lengths with tagsets is an other world....?

---->-- ja karman --<-----
Super User
Posts: 6,938

Re: A question about reading url txt files by SAS

Hi Jaap, his input file does not look like HTML. At least not the HTML I'm familiar with.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Super User
Posts: 7,401

Re: A question about reading url txt files by SAS

It looks like marked-up data from an online form which has been posted via a messaging system.  Try stripping out the message headers, the checksums etc. up to <SEC-DOCUMENT>.  Its not XML as the tags do not close.

Valued Guide
Posts: 3,208

Re: A question about reading url txt files by SAS

Tom is right to indicate this as a text file.      I am happy to be back behind two 22 inch screens not a 7 inch single one.
The content of this special text-file looks to be sgml. Standard Generalized Markup Language - Wikipedia, the free encyclopedia a predecessor of current types.

Using the university edtion.....

- the 32767 has become standard instead of 255

- reading variable length added to Tom-s code. The automatic variable len is not stored

- using the shortcut ikke for storing the file as sasuser is incessible. In fact an error on that will block the session (pop-up error?) needing a restart. 

For other url code not being txt type the streaming record approach would be more applicable adding an line/tag parser.

43         libname ikke "/folders/myshortcuts/ikke";

NOTE: Libref IKKE was successfully assigned as follows:

       Engine:        V9

       Physical Name: /folders/myshortcuts/ikke

44        

45         data ikke.edgar ;

46           length text $32767 ;

47           infile 'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt'

48            url expandtabs truncover length=len ;

49           input text $varying32767. len ;

50         * list;

51         run;

NOTE: The infile 'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt' is:

       Filename=http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt,

       Local Host Name=localhost.localdomain,

       Local Host IP addr=::1,

       Service Hostname Name=www.sec.gov,

       Service IP addr=2.19.221.59,Service Name=N/A,

       Service Portno=80,Lrecl=32767,Recfm=Variable

NOTE: 4820 records were read from the infile 'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt'.

       The minimum record length was 0.

       The maximum record length was 131.

NOTE: The data set IKKE.EDGAR has 4820 observations and 1 variables.

NOTE: DATA statement used (Total process time):

       real time           4.01 seconds

       cpu time            3.83 seconds

---->-- ja karman --<-----
Ask a Question
Discussion stats
  • 11 replies
  • 400 views
  • 11 likes
  • 7 in conversation