Hello everyone,
I am trying to understand this piece of code whose aim is to read in a url address and extract some data from it.
so, the code is:
"
data SiteVisit;
filename foo url &SITE debug;
retain line countA1 file_date form_type name cik ein fyr accession smbl lagline gvkey;
infile foo lrecl=256 pad expandtabs ;
input line $char256. ;
linecount=_n_;
If _n_ = 1 then do;
if line eq ' ' then return;
line1 = UPCASE(htmldecode(compress(line, ' ')));
"
what I cannot figure out exactlcy, is that what is exactly the input line $char256. part. When I run the code on a sample url, I am getting some html signs for the variable line. However, what I am trying to do, is to read the exact words from this html. May be this is because I know noting about html. but, Can someone please explain the above code for me please?
P.S: an example of the address that I am trying to read is this address:
http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt
When I run the code, I get 66 lines (i.e the last linecount is 66). why is this happening?
It is plain text at that sample url so start more simply
data SiteVisit;
Filename foo url &SITE debug;
Length line 300;
infile foo lrecl= 1000;
Input;
Line = _infile_;
List;
If _n_ GT 100 then stop ;
run;
.that will show the text of the first 100 lines
The
input line $char256. ;
means read the first 256 characters as text (the $) into the variable line.
Peters solution is probably much better though you may want length line $ 300; or increase 300 to a larger value if you aren't getting all of the matching start/end elements.
What is the question? The file seems perfectly readable, it does not truncate at 66 lines.
NOTE: 4820 records were read from the infile
'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt'.
The minimum record length was 0.
The maximum record length was 131.
You can make your code to find the file simpler.
data _null_;
infile 'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt'
url expandtabs truncover ;
input ;
* list;
run;
The HTML does not know anything about records it is about tags.
Reading an URL with records will be successful when the HTML coding has be done with an code editor structure showing that lay-out. But there is no guarantee on that. Some basics on HTML is necessary.
Since you try to read with the $256. format, every time you have a line in the input file with less that 256 characters, SAS skips to the next line, until there are enough characters on one line. Obviously you have "only" 66 lines in the input file with > 256 characters.
You will find a NOTE in the log about this like "SAS went to a new line ...."
You need Tom's truncover option to avoid this (default) behaviour, which is "missover".
Default is FLOWOVER which will attempt to get the data from the next line.
MISSOVER is probably the problem as it says to NOT go to the next line and to set any field where there are not enough characters for the informat requested to MISSING. So if you try to read 256 characters from a line that only has 80 characters you end up with a variable value that is blank.
TRUNCOVER will also not move to the next line, but unlike missover it will not throw away the partial information when the input line is shorter than the input statement requested.
Yeah, went wrong there. Basically SAS keeps skipping lines until it has enough characters to satisfy the $256. format. This means it a) reduces the number of lines and b) discards a lot of data when, say, 10 characters are needed to fill the 256, but the next line contains 50 characters (40 characters dropped).
Hi guys, nice to go into technical details if record processing with fixed lengths in a world of HTML not being aware of those kind of things. Thinking in variable lengths with tagsets is an other world....?
Hi Jaap, his input file does not look like HTML. At least not the HTML I'm familiar with.
It looks like marked-up data from an online form which has been posted via a messaging system. Try stripping out the message headers, the checksums etc. up to <SEC-DOCUMENT>. Its not XML as the tags do not close.
Tom is right to indicate this as a text file. I am happy to be back behind two 22 inch screens not a 7 inch single one.
The content of this special text-file looks to be sgml. Standard Generalized Markup Language - Wikipedia, the free encyclopedia a predecessor of current types.
Using the university edtion.....
- the 32767 has become standard instead of 255
- reading variable length added to Tom-s code. The automatic variable len is not stored
- using the shortcut ikke for storing the file as sasuser is incessible. In fact an error on that will block the session (pop-up error?) needing a restart.
For other url code not being txt type the streaming record approach would be more applicable adding an line/tag parser.
43 libname ikke "/folders/myshortcuts/ikke";
NOTE: Libref IKKE was successfully assigned as follows:
Engine: V9
Physical Name: /folders/myshortcuts/ikke
44
45 data ikke.edgar ;
46 length text $32767 ;
47 infile 'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt'
48 url expandtabs truncover length=len ;
49 input text $varying32767. len ;
50 * list;
51 run;
NOTE: The infile 'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt' is:
Filename=http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt,
Local Host Name=localhost.localdomain,
Local Host IP addr=::1,
Service Hostname Name=www.sec.gov,
Service IP addr=2.19.221.59,Service Name=N/A,
Service Portno=80,Lrecl=32767,Recfm=Variable
NOTE: 4820 records were read from the infile 'http://www.sec.gov:80/Archives/edgar/data/1050122/0000927356-01-000365.txt'.
The minimum record length was 0.
The maximum record length was 131.
NOTE: The data set IKKE.EDGAR has 4820 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 4.01 seconds
cpu time 3.83 seconds
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.