I use a segment of code from the paper 052-2009 written by Rick Langston in SAS Global Forum 2009 to creating text file from HTML. The codes read
filename in url 'https://www.zaobao.com/realtime/china/story20211010-1201937';
filename out 'd:\mydata\myfile.txt';
data _null_;
infile in lrecl=1 recfm=f end=eof;
file out lrecl=1 recfm=f;
input @1 x $char1.;
put @1 x $char1.;
if eof;
call symputx('filesize',_n_);
run;
I'm puzzled by then followings:
1)What's a data line of input HTML file? Do options lrecl=1 recfm=m in infile statement let input statement read atmost one character each time?
2)How Does input statement read a Chinese character into variable x, Since the informt $char1. makes it read one character eacht time?
3) I want create sas data set using the code
filename in url 'https://www.zaobao.com/realtime/china/story20211010-1201937';
filename out 'd:\mydata\myfile.txt';
data dst;
infile in lrecl=1 recfm=f end=eof;
file out lrecl=1 recfm=f;
input @1 x $char1.;
put @1 x $char1.;
if eof;
call symputx('filesize',_n_);
run;
The SAS dataset dst only has on observation with missing value. Why?
Correct. RECFM=F with LRECL=1 makes each byte a record. Each INPUT will read exactly one byte.
RECFM=F means records of a fixed length, any line-ending character or sequence (LF, CR, CRLF) is disregarded and in fact read as bytes.
LRECL=1 means to always read one character as one record. UTF characters will be read peacemeal, each of their constituent bytes separately.
Your code defines a boolean variable (eof) for the end of input data which will be set to true when the last byte is read. Since you use it in a Subsetting IF, only that byte will end up in the dataset. But your code will write all web data to the text file (because the INPUT/PUT happen before the subsetting IF).
Thanks!
Another question:
To my knowledge, input statement reads one character of the current dataline in current data-step loop,and read one character of the next dataline in the next loop without a double trailing @@. Is it right that input statement take each character as a seperate dataline since any line-ending character or sequence (LF, CR, CRLF) is disregarded ?
Correct. RECFM=F with LRECL=1 makes each byte a record. Each INPUT will read exactly one byte.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.