BookmarkSubscribeRSS Feed
Shayan2012
Quartz | Level 8

Hello everyone,

I am trying to understand this piece of code whose aim is to read in a url address and extract some data from it.

so, the code is:

"

data SiteVisit;

filename foo url &SITE debug;

retain  line countA1 file_date form_type name cik ein fyr accession smbl lagline gvkey;

infile foo lrecl=256 pad expandtabs ;

input line $char256. ;

linecount=_n_;

If _n_ = 1 then do;

if line eq ' ' then return;

line1 = UPCASE(htmldecode(compress(line, ' ')));

"

what I cannot figure out exactlcy, is that what is exactly the input line $char256. part. When I run the code on a sample url, I am getting some html signs for the variable line. However, what I am trying to do, is to read the exact words from this html. May be this is because I know noting about html. but, Can someone please explain the above code for me please?

P.S: an example of the address that I am trying to read is this address:

http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt

When I run the code, I get 66 lines (i.e the last linecount is 66). why is this happening?

11 REPLIES 11
Peter_C
Rhodochrosite | Level 12

It is plain text at that sample url so start more simply

data SiteVisit;

Filename foo url &SITE debug;

Length line 300;

infile foo lrecl= 1000;

Input;

Line = _infile_;

List;

If _n_ GT 100 then stop ;

run;

.that will show the text of the first 100 lines

ballardw
Super User

The

input line $char256. ;

means read the first 256 characters as text (the $) into the variable line.

Peters solution is probably much better though you may want length line $ 300; or increase 300 to a larger value if you aren't getting all of the matching start/end elements.

Tom
Super User Tom
Super User

What is the question?  The file seems perfectly readable, it does not truncate at 66 lines.

NOTE: 4820 records were read from the infile

      'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt'.

      The minimum record length was 0.

      The maximum record length was 131.

You can make your code to find the file simpler.

data _null_;

  infile 'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt'

   url expandtabs truncover ;

  input ;

* list;

run;

jakarman
Barite | Level 11

The HTML does not know anything about records it is about tags.

Reading an URL with records will be successful when the HTML coding has be done with an code editor structure showing that lay-out. But there is no guarantee on that. Some  basics on HTML is necessary.

---->-- ja karman --<-----
Kurt_Bremser
Super User

Since you try to read with the $256. format, every time you have a line in the input file with less that 256 characters, SAS skips to the next line, until there are enough characters on one line. Obviously you have "only" 66 lines in the input file with > 256 characters.

You will find a NOTE in the log about this like "SAS went to a new line ...."

You need Tom's truncover option to avoid this (default) behaviour, which is "missover".

Tom
Super User Tom
Super User

Default is FLOWOVER which will attempt to get the data from the next line.

MISSOVER is probably the problem as it says to NOT go to the next line and to set any field where there are not enough characters for the informat requested to MISSING.  So if you try to read 256 characters from a line that only has 80 characters you end up with a variable value that is blank.

TRUNCOVER will also not move to the next line, but unlike missover it will not throw away the partial information when the input line is shorter than the input statement requested.

Kurt_Bremser
Super User

Yeah, went wrong there. Basically SAS keeps skipping lines until it has enough characters to satisfy the $256. format. This means it a) reduces the number of lines and b) discards a lot of data when, say, 10 characters are needed to fill the 256, but the next line contains 50 characters (40 characters dropped).

jakarman
Barite | Level 11

Hi guys, nice to go into technical details if record processing with fixed lengths in a world of HTML not being aware of those kind of things. Thinking in variable lengths with tagsets is an other world....?

---->-- ja karman --<-----
RW9
Diamond | Level 26 RW9
Diamond | Level 26

It looks like marked-up data from an online form which has been posted via a messaging system.  Try stripping out the message headers, the checksums etc. up to <SEC-DOCUMENT>.  Its not XML as the tags do not close.

jakarman
Barite | Level 11

Tom is right to indicate this as a text file.      I am happy to be back behind two 22 inch screens not a 7 inch single one.
The content of this special text-file looks to be sgml. Standard Generalized Markup Language - Wikipedia, the free encyclopedia a predecessor of current types.

Using the university edtion.....

- the 32767 has become standard instead of 255

- reading variable length added to Tom-s code. The automatic variable len is not stored

- using the shortcut ikke for storing the file as sasuser is incessible. In fact an error on that will block the session (pop-up error?) needing a restart. 

For other url code not being txt type the streaming record approach would be more applicable adding an line/tag parser.

43         libname ikke "/folders/myshortcuts/ikke";

NOTE: Libref IKKE was successfully assigned as follows:

       Engine:        V9

       Physical Name: /folders/myshortcuts/ikke

44        

45         data ikke.edgar ;

46           length text $32767 ;

47           infile 'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt'

48            url expandtabs truncover length=len ;

49           input text $varying32767. len ;

50         * list;

51         run;

NOTE: The infile 'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt' is:

       Filename=http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt,

       Local Host Name=localhost.localdomain,

       Local Host IP addr=::1,

       Service Hostname Name=www.sec.gov,

       Service IP addr=2.19.221.59,Service Name=N/A,

       Service Portno=80,Lrecl=32767,Recfm=Variable

NOTE: 4820 records were read from the infile 'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt'.

       The minimum record length was 0.

       The maximum record length was 131.

NOTE: The data set IKKE.EDGAR has 4820 observations and 1 variables.

NOTE: DATA statement used (Total process time):

       real time           4.01 seconds

       cpu time            3.83 seconds

---->-- ja karman --<-----

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1562 views
  • 11 likes
  • 7 in conversation