BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Junyong
Pyrite | Level 9

There is an HTML, one of whose lines is similar (Though 472 here, the real file is much longer than 32,767.) to

This is a <a href="data1.csv">data1</a>.<br /> This is a <a href="data2.csv">data2</a>.<br /> This is a <a href="data3.csv">data3</a>.<br /> This is a <a href="data4.csv">data4</a>.<br /> This is a <a href="data5.csv">data5</a>.<br /> This is a <a href="data6.csv">data6</a>.<br /> This is a <a href="data7.csv">data7</a>.<br /> This is a <a href="data8.csv">data8</a>.<br /> This is a <a href="data9.csv">data9</a>.<br /> This is a <a href="data10.csv">data10</a>.<br />

and I found that the following code does not fully work due to the length of this line.

data read;
	infile htmlfile length=length;
	input htm $varying32767. length;
	do i=1 to 10;
		htm=substr(htm,find(htm,"href",2)+6);
		file=substr(htm,1,find(htm,"csv")+2);
		output;
	end;
	drop htm i;
run;

Would there be any better approach in this case? Thanks for help.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Why are you trying to read the whole file into a single variable?  Why not just read each "word" separately?


data read;
  infile htmlfile length=length column=column truncover lrecl=3000000;
  row +1;
  do wordno=1 by 1 until(column>length);
    input word ~:$32767.  @;
    if upcase(word)=: 'HREF=' then do;
      length file $255 ;
      file=scan(word,2,'"');
      output;
    end;
  end;
run;
proc print; run;

 

Obs    row    wordno                  word                     file

  1     1        5      href="data1.csv">data1</a>.<br      data1.csv
  2     1       11      href="data2.csv">data2</a>.<br      data2.csv
  3     1       17      href="data3.csv">data3</a>.<br      data3.csv
  4     1       23      href="data4.csv">data4</a>.<br      data4.csv
  5     1       29      href="data5.csv">data5</a>.<br      data5.csv
  6     1       35      href="data6.csv">data6</a>.<br      data6.csv
  7     1       41      href="data7.csv">data7</a>.<br      data7.csv
  8     1       47      href="data8.csv">data8</a>.<br      data8.csv
  9     1       53      href="data9.csv">data9</a>.<br      data9.csv
 10     1       59      href="data10.csv">data10</a>.<br    data10.csv

View solution in original post

5 REPLIES 5
Tom
Super User Tom
Super User

Why are you trying to read the whole file into a single variable?  Why not just read each "word" separately?


data read;
  infile htmlfile length=length column=column truncover lrecl=3000000;
  row +1;
  do wordno=1 by 1 until(column>length);
    input word ~:$32767.  @;
    if upcase(word)=: 'HREF=' then do;
      length file $255 ;
      file=scan(word,2,'"');
      output;
    end;
  end;
run;
proc print; run;

 

Obs    row    wordno                  word                     file

  1     1        5      href="data1.csv">data1</a>.<br      data1.csv
  2     1       11      href="data2.csv">data2</a>.<br      data2.csv
  3     1       17      href="data3.csv">data3</a>.<br      data3.csv
  4     1       23      href="data4.csv">data4</a>.<br      data4.csv
  5     1       29      href="data5.csv">data5</a>.<br      data5.csv
  6     1       35      href="data6.csv">data6</a>.<br      data6.csv
  7     1       41      href="data7.csv">data7</a>.<br      data7.csv
  8     1       47      href="data8.csv">data8</a>.<br      data8.csv
  9     1       53      href="data9.csv">data9</a>.<br      data9.csv
 10     1       59      href="data10.csv">data10</a>.<br    data10.csv
Junyong
Pyrite | Level 9

Many thanks for your help. May I ask something?

(1) I have used LENGTH (in INPUT), TRUNCOVER, and LRECL, but not at all COLUMN. What is its function here?

(2) I found that it works without ROW+1. What is it doing? Is it necessary?

(3) Should LENGTH (inside DO-END) be located inside the loop? Can I put it outside the loop (for example, right after INFILE)?

Sorry to bother you much.

Tom
Super User Tom
Super User

Read the documentation on the INFILE statement to understand how the LENGTH= and COLUMN= options work.

ROW+1 is an example of a sum statement.  It is creating a new variable named ROW that increases by one every time the data step iterates. So the variable indicates which line (row) in the source file this word was read from.

I placed the LENGTH statement inside the loop so that the variables are defined in the order I wanted them to be defined.  SAS will define the variables in the order that it sees them when compiling the data step code.  If I move it higher in the data step and still want ROW and WORDNO to appear earlier then I could include them in the LENGTH statement.  Note that if you set the length for WORD before the INPUT statement there is no need to include an informat in the INPUT statement.

data read;
  infile htmlfile length=length column=column truncover lrecl=3000000;
  length row wordno 8 word $32767 file $255 ;
  row +1;
  do wordno=1 by 1 until(column>length);
    input word ~  @;
    if upcase(word)=: 'HREF=' then do;
      file=scan(word,2,'"');
      output;
    end;
  end;
run;

 

Junyong
Pyrite | Level 9

Thanks for your details. I was a bit curious because

(1) ROW+1 usually creates a variable that increases by 1, but not here. I found that the ROW is all 1 as there is just one line.

(2) I found that your code is reading each word delimited by the blank. Rather than TRUNCOVER in INFILE, the code separates each block by the tilde and the colon.

I have used SAS mostly with ready-made data such as CSV, so was not familiar with this processing. Appreciate again.

Tom
Super User Tom
Super User

@Junyong wrote:

Thanks for your details. I was a bit curious because

(1) ROW+1 usually creates a variable that increases by 1, but not here. I found that the ROW is all 1 as there is just one line.

(2) I found that your code is reading each word delimited by the blank. Rather than TRUNCOVER in INFILE, the code separates each block by the tilde and the colon.

I have used SAS mostly with ready-made data such as CSV, so was not familiar with this processing. Appreciate again.


ROW increases by one for each line read from the file. Your example file only had one line so you never see any increase.

Yes the default delimiter is a space. ~ and : have nothing to do with separating the line, they are modifiers on the INPUT statement. 

The : tells the INPUT statement to use list mode input even though there is an informat specification. Without that the input statement would read exactly the number of characters the informat specified and so it could end up "eating" some of the delimiters between the words. Note that in the example where I defined the length of WORD before using it in the INPUT statement I removed both the informat and the : modifier.

I don't think you need to use the ~ modifier in this problem. The ~ on an INPUT statement just tells SAS not to remove quotes around values when using the DSD option. The DSD option is needed when you want to treat adjacent delimiters as an indication of a missing (aka null) value. In that type of file if your actual data includes the delimiter character you need to enclose the value in quotes. So when using the DSD option SAS will remove the quotes around values that start and end with a quote. I think I included it because I needed it for a similar question someone else had asked recently where they wanted to use the DSD option.  

 

CSV files are an example of a type of file where you would normally want to use the DSD option.  Especially if it is literally a Comma Separated Values file since it pretty common for commas be part of the value of a variable.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 891 views
  • 0 likes
  • 2 in conversation