I use a segment of code from the paper 052-2009 written by Rick Langston in SAS Global Forum 2009 to creating text file from HTML. The codes read
filename in url 'https://www.zaobao.com/realtime/china/story20211010-1201937';
filename out 'd:\mydata\myfile.txt';
data _null_;
infile in lrecl=1 recfm=f end=eof;
file out lrecl=1 recfm=f;
input @1 x $char1.;
put @1 x $char1.;
if eof;
call symputx('filesize',_n_);
run;
I'm puzzled by then followings:
1)What's a data line of input HTML file? Do options lrecl=1 recfm=m in infile statement let input statement read atmost one character each time?
2)How Does input statement read a Chinese character into variable x, Since the informt $char1. makes it read one character eacht time?
3) I want create sas data set using the code
filename in url 'https://www.zaobao.com/realtime/china/story20211010-1201937';
filename out 'd:\mydata\myfile.txt';
data dst;
infile in lrecl=1 recfm=f end=eof;
file out lrecl=1 recfm=f;
input @1 x $char1.;
put @1 x $char1.;
if eof;
call symputx('filesize',_n_);
run;
The SAS dataset dst only has on observation with missing value. Why?
Correct. RECFM=F with LRECL=1 makes each byte a record. Each INPUT will read exactly one byte.
RECFM=F means records of a fixed length, any line-ending character or sequence (LF, CR, CRLF) is disregarded and in fact read as bytes.
LRECL=1 means to always read one character as one record. UTF characters will be read peacemeal, each of their constituent bytes separately.
Your code defines a boolean variable (eof) for the end of input data which will be set to true when the last byte is read. Since you use it in a Subsetting IF, only that byte will end up in the dataset. But your code will write all web data to the text file (because the INPUT/PUT happen before the subsetting IF).
Thanks!
Another question:
To my knowledge, input statement reads one character of the current dataline in current data-step loop,and read one character of the next dataline in the next loop without a double trailing @@. Is it right that input statement take each character as a seperate dataline since any line-ending character or sequence (LF, CR, CRLF) is disregarded ?
Correct. RECFM=F with LRECL=1 makes each byte a record. Each INPUT will read exactly one byte.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.