DATA Step, Macro, Functions and more

HTML string too long?

Reply
Contributor
Posts: 29

HTML string too long?

I'm trying to capture the source code for NBA box scores, and the string that contains the player stats is very long.  The SAS code seems to be able to capture and isolate the correct string.  The strange thing is I can't seem to capture the players near the end of the string; however, by looking at the web page I took one of the player names that seemed to be missing and used it in the output statement below and SAS output the same string indicating that player name is located in the string.  When I proc print the entire string and search for 'dotson' I can't find it, and the string is 18811 characters even though I allocated 30000.  I also did some pattern matching to isolate the stats (code not shown here, too long) and the resulting dataset does not include the players at the end of the string, so it's not a limitation of proc print.  Does anyone know what I need to do to be able to access the players at the end of the string?

 

Thanks, Ryan

 

 

filename junk url "http://www.espn.com/nba/boxscore?gameId=400978762" debug lrecl=30000;

data http;
    infile junk length=len;
    input record $varying30000. len;
run;

data test;
    set http; 
    seq=_N_;
    if prxmatch('/dotson/i',record) then output;
run;

 

Super Contributor
Posts: 441

Re: HTML string too long?

Hi Ryan,

 

I ran you code and there seems to be nothing unexpected. So maybe you can help us understand  what exactly you are expecting and what it is you are getting.

 

Maybe you are exceeding the limit of the destination that proc print uses. If it is text than 200 is it. If you use PDF than Dodson is in the  proc print output.

 

What may help is that the log shows the HTML contains records longer than the SAS maximum of 32K (you max at 30000 but this can be bumped a little). This is the bit of log that tells me:

NOTE: 665 records were read from the infile JUNK.
      The minimum record length was 0.
      The maximum record length was 30000.
      One or more lines were truncated.

So you may loose player info due to the truncation. If that's the case consider reading the file differently where you do not go line by line but use a streaming approach and gobbling up the file in chunks ingnoring record terminators. There are plenty examples out there that use RECFM=N or similar. But always beware HTML is not exactly easy to interpret as it combines data and presentation. You will have to separate one from the other which can become tedious.

 

Hope this helps,

- Jan.

 

Contributor
Posts: 29

Re: HTML string too long?

Posted in reply to jklaverstijn

Thanks Jan.  Your suggestion to output to PDF helped me figure out the problem.  I didn't include enough iterations in my do loop in the code below.  I used 30 before (see i), but updating it to 45 made all of the players show up.  The strange thing was that when I used 30 I had two blank rows after the last row containing player information which made me think, in error, I had reached the end of the string.  Coincidentally, when I did a proc print (ODS HTML...) on the HTML string I was using it stopped at the same player (perhaps due to the limit of the ODS HTML output?).  When I output to pdf I was able to get all of the players.

 

 

data Stats_pre;
	set http; 
	keep Player_name iso_player Player_name ESPNID ESPNNAME Position Minutes fg _3pt ft oreb dreb reb ast stl blk to pf plusminus pts;
	length iso_player $750 ESPNNAME $50 Player_name Position Minutes fg _3pt ft oreb dreb reb ast stl blk to pf plusminus pts $25; 

	/* Set pattern matching */
	delete1_re=prxparse('s/\"\@\~//i');
	replace_delim_re=prxparse('s/(playercard)/@/i');
	replace_delim2_re=prxparse('s/(class\=)/~/i');
	delete2_re=prxparse('s/class\=//i');
	delete3_re=prxparse('s/[\<\>]//i');
	starters_re=prxparse('/Starters/');

	ESPNID_re=prxparse('@\/id\/\d+\/@');  /* /id/3970/demarre-carroll */
	ESPNNAME_re=prxparse('%\/[a-z\-]+\>\<%');
	_null_re=prxparse("/zzzzzzz/i");
	Player_name_re=prxparse("/\>\w+\.\s*[a-z\-\s\.\']+\</i");
	Position_re=prxparse("/\>\w+\</i");
	Minutes_re=prxparse("/\>\d+\</i");
	fg_re=prxparse("/\>\d+\-\d+\</i");
	num_re=prxparse("/\>\d+\</i");
	plusminus_re=prxparse("/\>[\+\-]\d+\</i");
	
	record=compress(record,'"'); /* delete double quotes */
	call prxchange(delete1_re,-1,record);  /* delete @ */
	call prxchange(replace_delim_re,-1,record);  /* replace "playercard" with @ as delimiter */
	call prxchange(replace_delim2_re,-1,record);  /* replace "class=" with ~ as delimiter */

	if prxmatch('/boxscore_tabs/i',record) then do;
		call prxsubstr(starters_re,record,pos,len);  /* Strip all text prior to boxscore data */
		record=substr(record,pos,30000-pos);

		if substr(record,1,5) eq 'Start' then do;
			do i=1 to 45;
				iso_player=scan(record,i,'@');

				array cols [17] _null Player_name Position Minutes fg _3pt ft oreb dreb reb ast stl blk to pf plusminus pts;
				array colre[17] _null_re Player_name_re Position_re Minutes_re fg_re fg_re fg_re num_re num_re num_re num_re num_re num_re num_re num_re plusminus_re num_re ;
				do j=1 to 17;
					temp=scan(iso_player,j,'~');

					if j eq 1 then do;
						call prxsubstr(ESPNID_re,temp,pos1,len1);  
						if pos1 ne 0 then ESPNID=input(substr(temp,pos1+4,len1-5),8.); else ESPNID=.;
						call prxsubstr(ESPNNAME_re,temp,pos2,len2);  
						if pos2 ne 0 then ESPNNAME=substr(temp,pos2+1,len2-3); else ESPNNAME='';
					end;

					call prxsubstr(colre{j},temp,posJ,lenJ);  
					if posJ ne 0 then cols{j}=substr(temp,posJ,lenJ); else cols{j}='';
					call prxchange(delete3_re,-1,cols{j});
					if find(iso_player,'DNP') and j not in (2 3) then cols{j}='';

					if j eq 17 and ESPNID ne . then output;
				end;
			end;
		end;
	end;
run;

 

Super User
Super User
Posts: 7,076

Re: HTML string too long?

That code is not going to get around the fact that a SAS variable can only be 32,767 bytes long and your source file might have lines that are longer than that. 

 

Contributor
Posts: 29

Re: HTML string too long?

Thanks Tom.  Yeah.  The length of the string I need for this example was ~26,000, so hopefully this string is representative of strings for other box scores as well and I won't have to worry about the 32,000 limit.

Ask a Question
Discussion stats
  • 4 replies
  • 184 views
  • 2 likes
  • 3 in conversation