Hi experts
I'm trying to extract specific components from an HTML page with the following code:
filename src temp;
proc http
url=&url
out=src;
run;
data _null_;
infile src;
input;
list;
run;
data rep;
infile src length=len lrecl=30000000;
input line $varying32767. len;
line = strip(line);
if len>0;
run;
The problem I face is that the part of the HTML page I need is beyond the 32767. character limit of the varying format, so it's probably left out by SAS when reading the input line
There is a way for large HTML bodies to:
a. break the HTML body in lines < of te 32767 limit?
I'm testing the code on this url:
- https://www.youtube.com/c/sasusers/featured
And I want to isolate the line which include this HTML line of text:
'metadata":{"channelMetadataRenderer":{'
You can read longer lines with a data step. Just use a longer LRECL on the INFILE statement.
But you cannot make a character variable longer than 32,767 bytes.
So you will either need to read the longer lines as multiple lines (lines breaks don't really mean anything to HTML) or as multiple variables.
For example this will create a dataset with a ROW and COL numeric variables and one LOONG character variable named LINE. The IF statement will remove any observations where LINE is completely empty.
data rep;
infile src length=len lrecl=30000000 column=cc truncover ;
row+1;
do col=1 by 1 until (cc>len);
input line $char32767. @;
if line ne ' ' then output;
end;
run;
To add more info, this is the part in the log where it says that few lines were truncated:
101 data rep;
102 infile src _infile_=line length=len lrecl=32767;
103 input line $varying32767. len;
104 line = strip(line);
105 if len>0;
106 run;
NOTE: La compresión del conjunto de datos WORK.REP está deshabilitada porque aumentaría el tamaño del conjunto de datos.
NOTE: The infile SRC is:
Nombre archivo=/sastmp/SAS_workC18B0001C28C_miseiddvp1/#LN00060,
Nombre de propietario=spndac,
Nombre del grupo=europe,
Permiso de acceso=-rw-rw-r--,
Última modificación=09 de junio de 2023 18H09,
Tamaño de archivo (bytes)=842654
NOTE: 32 records were read from the infile SRC.
The minimum record length was 0.
The maximum record length was 32767.
One or more lines were truncated.
You can read longer lines with a data step. Just use a longer LRECL on the INFILE statement.
But you cannot make a character variable longer than 32,767 bytes.
So you will either need to read the longer lines as multiple lines (lines breaks don't really mean anything to HTML) or as multiple variables.
For example this will create a dataset with a ROW and COL numeric variables and one LOONG character variable named LINE. The IF statement will remove any observations where LINE is completely empty.
data rep;
infile src length=len lrecl=30000000 column=cc truncover ;
row+1;
do col=1 by 1 until (cc>len);
input line $char32767. @;
if line ne ' ' then output;
end;
run;
Tom your code seems indeed breaking lines longer than 32767 bytes into multiple lines, so terrific. I'm trying to understand the logic of the code:
- row it's just a counter getting +1 at each inputline iteration
- col is a second counter which get a +1 whenever the limit if 32767byte is reached and the html input line is broken into more than one line
- The @ prevent the input statement to release the current input record and reading the next into the buffer
So for my understanding: it is the "@" that avoid to lose the remaining part of the line, exceeding the 32767 byte limit, and force the pointer to store them in a second sas dataset line?
@dcortell wrote:
Tom your code seems indeed breaking lines longer than 32767 bytes into multiple lines, so terrific. I'm trying to understand the logic of the code:
- row it's just a counter getting +1 at each inputline iteration
- col is a second counter which get a +1 whenever the limit if 32767byte is reached and the html input line is broken into more than one line
- The @ prevent the input statement to release the current input record and reading the next into the buffer
So for my understanding: it is the "@" that avoid to lose the remaining part of the line, exceeding the 32767 byte limit, and force the pointer to store them in a second sas dataset line?
The LENGTH= and COLUMN= options on the INFILE statement create variables that SAS will set to the length of the current line and current position that the next INPUT statement will start reading from.
The UNTIL() clause of the DO loop tells it to stop reading when it has already read past the end of the line.
The TRUNCOVER option prevents the INPUT statement from jumping to the next line when there are less than 32,767 bytes left on the line. Using the modern TRUNCOVER option instead of the ancient MISSOVER option allows it to use that last truncated part of the line instead of throwing it away like the MISSOVER option would have done.
In reality your normal HTML file will have lines much shorter than 32,767 bytes. So it might be more efficient to use a shorter LINE variable. Perhaps something like 200 or 300 bytes.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.