BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
dcortell
Pyrite | Level 9

Hi experts

 

I'm trying to extract specific components from an HTML page with the following code:

 

filename src temp;
proc http
url=&url
out=src;
run;

data _null_;
infile src;
input;
list;
run;

data rep;
infile src length=len lrecl=30000000;
input line $varying32767. len;
line = strip(line);
if len>0;
run;

The problem I face is that the part of the HTML page I need is beyond the 32767. character limit of the varying format, so it's probably left out by SAS when reading the input line

 

There is a way for large HTML bodies to:

 

a. break the HTML body in lines < of te 32767 limit?

 

I'm testing the code on this url:

https://www.youtube.com/c/sasusers/featured

 

And I want to isolate the line which include this HTML line of text:

'metadata":{"channelMetadataRenderer":{'

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You can read longer lines with a data step.  Just use a longer LRECL on the INFILE statement.

 

But you cannot make a character variable longer than 32,767 bytes. 

So you will either need to read the longer lines as multiple lines (lines breaks don't really mean anything to HTML) or as multiple variables.

 

For example this will create a dataset with a ROW and COL numeric variables and one LOONG character variable named LINE.  The IF statement will remove any observations where LINE is completely empty.

data rep;
  infile src length=len lrecl=30000000 column=cc truncover ;
  row+1;
  do col=1 by 1 until (cc>len);
     input line $char32767. @;
     if line ne ' ' then output;
  end;
run;

 

View solution in original post

4 REPLIES 4
dcortell
Pyrite | Level 9

To add more info, this is the part in the log where it says that few lines were truncated:

 

 101        data rep;
 102        infile src _infile_=line length=len lrecl=32767;
 103        input line $varying32767. len;
 104        line = strip(line);
 105        if len>0;
 106        run;
 
 NOTE: La compresión del conjunto de datos WORK.REP  está deshabilitada porque aumentaría el tamaño del  conjunto de datos.
 NOTE: The infile SRC is:
       Nombre archivo=/sastmp/SAS_workC18B0001C28C_miseiddvp1/#LN00060,
       Nombre de propietario=spndac,
       Nombre del grupo=europe,
       Permiso de acceso=-rw-rw-r--,
       Última modificación=09 de junio de 2023 18H09,
       Tamaño de archivo (bytes)=842654
 
 NOTE: 32 records were read from the infile SRC.
       The minimum record length was 0.
       The maximum record length was 32767.
       One or more lines were truncated.
Tom
Super User Tom
Super User

You can read longer lines with a data step.  Just use a longer LRECL on the INFILE statement.

 

But you cannot make a character variable longer than 32,767 bytes. 

So you will either need to read the longer lines as multiple lines (lines breaks don't really mean anything to HTML) or as multiple variables.

 

For example this will create a dataset with a ROW and COL numeric variables and one LOONG character variable named LINE.  The IF statement will remove any observations where LINE is completely empty.

data rep;
  infile src length=len lrecl=30000000 column=cc truncover ;
  row+1;
  do col=1 by 1 until (cc>len);
     input line $char32767. @;
     if line ne ' ' then output;
  end;
run;

 

dcortell
Pyrite | Level 9

Tom your code seems indeed breaking lines longer than 32767 bytes into multiple lines, so terrific. I'm trying to understand the logic of the code:

 

- row it's just a counter getting +1 at each inputline iteration

 

- col is a second counter which get a +1 whenever the limit if 32767byte is reached and the html input line is broken into more than one line

 

- The @ prevent the input statement to release the current input record and reading the next into the buffer

 

So for my understanding: it is the "@" that avoid to lose the remaining part of the line, exceeding the 32767 byte limit, and force the pointer to store them in a second sas dataset line?

Tom
Super User Tom
Super User

@dcortell wrote:

Tom your code seems indeed breaking lines longer than 32767 bytes into multiple lines, so terrific. I'm trying to understand the logic of the code:

 

- row it's just a counter getting +1 at each inputline iteration

 

- col is a second counter which get a +1 whenever the limit if 32767byte is reached and the html input line is broken into more than one line

 

- The @ prevent the input statement to release the current input record and reading the next into the buffer

 

So for my understanding: it is the "@" that avoid to lose the remaining part of the line, exceeding the 32767 byte limit, and force the pointer to store them in a second sas dataset line?


The LENGTH= and COLUMN= options on the INFILE statement create variables that SAS will set to the length of the current line and current position that the next INPUT statement will start reading from.

 

The UNTIL() clause of the DO loop tells it to stop reading when it has already read past the end of the line.  

 

The TRUNCOVER option prevents the INPUT statement from jumping to the next line when there are less than 32,767 bytes left on the line.  Using the modern TRUNCOVER option instead of the ancient MISSOVER option allows it to use that last truncated part of the line instead of throwing it away like the MISSOVER option would have done.

 

In reality your normal HTML file will have lines much shorter than 32,767 bytes.  So it might be more efficient to use a shorter LINE variable.  Perhaps something like 200 or 300 bytes.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1204 views
  • 1 like
  • 2 in conversation