There is an HTML, one of whose lines is similar (Though 472 here, the real file is much longer than 32,767.) to
This is a <a href="data1.csv">data1</a>.<br /> This is a <a href="data2.csv">data2</a>.<br /> This is a <a href="data3.csv">data3</a>.<br /> This is a <a href="data4.csv">data4</a>.<br /> This is a <a href="data5.csv">data5</a>.<br /> This is a <a href="data6.csv">data6</a>.<br /> This is a <a href="data7.csv">data7</a>.<br /> This is a <a href="data8.csv">data8</a>.<br /> This is a <a href="data9.csv">data9</a>.<br /> This is a <a href="data10.csv">data10</a>.<br />
and I found that the following code does not fully work due to the length of this line.
data read;
infile htmlfile length=length;
input htm $varying32767. length;
do i=1 to 10;
htm=substr(htm,find(htm,"href",2)+6);
file=substr(htm,1,find(htm,"csv")+2);
output;
end;
drop htm i;
run;
Would there be any better approach in this case? Thanks for help.
Why are you trying to read the whole file into a single variable? Why not just read each "word" separately?
data read;
infile htmlfile length=length column=column truncover lrecl=3000000;
row +1;
do wordno=1 by 1 until(column>length);
input word ~:$32767. @;
if upcase(word)=: 'HREF=' then do;
length file $255 ;
file=scan(word,2,'"');
output;
end;
end;
run;
proc print; run;
Obs row wordno word file 1 1 5 href="data1.csv">data1</a>.<br data1.csv 2 1 11 href="data2.csv">data2</a>.<br data2.csv 3 1 17 href="data3.csv">data3</a>.<br data3.csv 4 1 23 href="data4.csv">data4</a>.<br data4.csv 5 1 29 href="data5.csv">data5</a>.<br data5.csv 6 1 35 href="data6.csv">data6</a>.<br data6.csv 7 1 41 href="data7.csv">data7</a>.<br data7.csv 8 1 47 href="data8.csv">data8</a>.<br data8.csv 9 1 53 href="data9.csv">data9</a>.<br data9.csv 10 1 59 href="data10.csv">data10</a>.<br data10.csv
Why are you trying to read the whole file into a single variable? Why not just read each "word" separately?
data read;
infile htmlfile length=length column=column truncover lrecl=3000000;
row +1;
do wordno=1 by 1 until(column>length);
input word ~:$32767. @;
if upcase(word)=: 'HREF=' then do;
length file $255 ;
file=scan(word,2,'"');
output;
end;
end;
run;
proc print; run;
Obs row wordno word file 1 1 5 href="data1.csv">data1</a>.<br data1.csv 2 1 11 href="data2.csv">data2</a>.<br data2.csv 3 1 17 href="data3.csv">data3</a>.<br data3.csv 4 1 23 href="data4.csv">data4</a>.<br data4.csv 5 1 29 href="data5.csv">data5</a>.<br data5.csv 6 1 35 href="data6.csv">data6</a>.<br data6.csv 7 1 41 href="data7.csv">data7</a>.<br data7.csv 8 1 47 href="data8.csv">data8</a>.<br data8.csv 9 1 53 href="data9.csv">data9</a>.<br data9.csv 10 1 59 href="data10.csv">data10</a>.<br data10.csv
Many thanks for your help. May I ask something?
(1) I have used LENGTH (in INPUT), TRUNCOVER, and LRECL, but not at all COLUMN. What is its function here?
(2) I found that it works without ROW+1. What is it doing? Is it necessary?
(3) Should LENGTH (inside DO-END) be located inside the loop? Can I put it outside the loop (for example, right after INFILE)?
Sorry to bother you much.
Read the documentation on the INFILE statement to understand how the LENGTH= and COLUMN= options work.
ROW+1 is an example of a sum statement. It is creating a new variable named ROW that increases by one every time the data step iterates. So the variable indicates which line (row) in the source file this word was read from.
I placed the LENGTH statement inside the loop so that the variables are defined in the order I wanted them to be defined. SAS will define the variables in the order that it sees them when compiling the data step code. If I move it higher in the data step and still want ROW and WORDNO to appear earlier then I could include them in the LENGTH statement. Note that if you set the length for WORD before the INPUT statement there is no need to include an informat in the INPUT statement.
data read;
infile htmlfile length=length column=column truncover lrecl=3000000;
length row wordno 8 word $32767 file $255 ;
row +1;
do wordno=1 by 1 until(column>length);
input word ~ @;
if upcase(word)=: 'HREF=' then do;
file=scan(word,2,'"');
output;
end;
end;
run;
Thanks for your details. I was a bit curious because
(1) ROW+1 usually creates a variable that increases by 1, but not here. I found that the ROW is all 1 as there is just one line.
(2) I found that your code is reading each word delimited by the blank. Rather than TRUNCOVER in INFILE, the code separates each block by the tilde and the colon.
I have used SAS mostly with ready-made data such as CSV, so was not familiar with this processing. Appreciate again.
@Junyong wrote:
Thanks for your details. I was a bit curious because
(1) ROW+1 usually creates a variable that increases by 1, but not here. I found that the ROW is all 1 as there is just one line.
(2) I found that your code is reading each word delimited by the blank. Rather than TRUNCOVER in INFILE, the code separates each block by the tilde and the colon.
I have used SAS mostly with ready-made data such as CSV, so was not familiar with this processing. Appreciate again.
ROW increases by one for each line read from the file. Your example file only had one line so you never see any increase.
Yes the default delimiter is a space. ~ and : have nothing to do with separating the line, they are modifiers on the INPUT statement.
The : tells the INPUT statement to use list mode input even though there is an informat specification. Without that the input statement would read exactly the number of characters the informat specified and so it could end up "eating" some of the delimiters between the words. Note that in the example where I defined the length of WORD before using it in the INPUT statement I removed both the informat and the : modifier.
I don't think you need to use the ~ modifier in this problem. The ~ on an INPUT statement just tells SAS not to remove quotes around values when using the DSD option. The DSD option is needed when you want to treat adjacent delimiters as an indication of a missing (aka null) value. In that type of file if your actual data includes the delimiter character you need to enclose the value in quotes. So when using the DSD option SAS will remove the quotes around values that start and end with a quote. I think I included it because I needed it for a similar question someone else had asked recently where they wanted to use the DSD option.
CSV files are an example of a type of file where you would normally want to use the DSD option. Especially if it is literally a Comma Separated Values file since it pretty common for commas be part of the value of a variable.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.