BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Junyong
Pyrite | Level 9

Suppose I read the following data.

/*SPAN is incorrectly read*/
data want1;
	infile cards dlm="'""" truncover;
	input @"name='" name :$16. @'value="' value :$16. @'title="' title :$32. @"<span>" span :$16. @'etc="' etc :$16.;
cards;
name='united states' value="123456" title="<b>results 2016</b>" <span>new york</span> etc="jimmy"
name='canada' value="12345" title="yearly statistics" <span>toronto</span> etc="dawn"
name='japan' value="1234567" title="annual bulletin" <span>osaka</span> etc="fujita"
;

Then SPAN will be read incorrectly. The problem is that I cannot include < because some other variables include it.

/*cannot include < as other variables contain it*/
data want2;
	infile cards dlm="'""<" truncover;
	input @"name='" name :$16. @'value="' value :$16. @'title="' title :$32. @"<span>" span :$16. @'etc="' etc :$16.;
cards;
name='united states' value="123456" title="<b>results 2016</b>" <span>new york</span> etc="jimmy"
name='canada' value="12345" title="yearly statistics" <span>toronto</span> etc="dawn"
name='japan' value="1234567" title="annual bulletin" <span>osaka</span> etc="fujita"
;

Thus, including < in DLM will influence TITLE. Instead, I saw DLMSTR, which can handle multi-character delimiters as follows. The problem, however, is that it also affects characters in DLM, so the results are totally changed as follows.

/*DLMSTR worsens this issue*/
data want3;
	infile cards dlm="'""" dlmstr="</span>" truncover;
	input @"name='" name :$16. @'value="' value :$16. @'title="' title :$32. @"<span>" span :$16. @'etc="' etc :$16.;
cards;
name='united states' value="123456" title="<b>results 2016</b>" <span>new york</span> etc="jimmy"
name='canada' value="12345" title="yearly statistics" <span>toronto</span> etc="dawn"
name='japan' value="1234567" title="annual bulletin" <span>osaka</span> etc="fujita"
;

In this case, can I control the ending point of SPAN via @ so that I can correctly read the strings inside (new york, toronto, and osaka)?

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

This works with your original example:

data want1;
	infile cards dlm=" " dsd truncover;
	input @"name=" name :$16. @'value=' value :$16. @'title=' title :$32. @'etc=' etc :$16.;
   length span $ 16;
   startpos = index(_infile_,'<span>')+6;
   endpos   =  index(_infile_,'</span>');
   span = substr(_infile_,startpos,endpos-startpos);
cards;
name='united states' value="123456" title="<b>results 2016</b>" <span>new york</span> etc="jimmy"
name='canada' value="12345" title="yearly statistics" <span>toronto</span> etc="dawn"
name='japan' value="1234567" title="annual bulletin" <span>osaka</span> etc="fujita"
;

Note that I removed the quotes from the DLM. Values inside quotes will be read just fine. So also removed the quote ending the @name strings. If you change the informat on Value to 16. it will read as number if that is a desired result.

The DSD option prevents the spaces inside the quoted strings from use as delimiters.

 

Warning: you may want to specify a larger than 256 for your infile URL. I think the URL may default to 256, at least the FILENAME statement will default to that for when using a URL Filename statement.

 

You would likely want to drop the Startpos and Endpos variables I created for finding the value of Span. Or move the Index function calls into the substr call.

View solution in original post

4 REPLIES 4
ballardw
Super User

What file type is this coming from? Some of the stuff looks like it might come from XML which has a LIBNAME to read (many) forms.

Or perhaps a website? Proc HTTP might be the way to go.

 

 

Junyong
Pyrite | Level 9

I'm trying to scrape webpages. The following minimum example collects the HREFs and titles of current Latest Post items from a SAS Support Communities homepage.

data _null_;
	infile "https://communities.sas.com/" url truncover;
	input;
	_infile_=strip(tranwrd(_infile_,"09"x,""));
	if _infile_="<!-- TAB ONE BEGINS -->" then i+1;
	else if _infile_="<!-- TAB ONE ENDS -->" then i+-1;
	file "!userprofile\desktop\page.htm";
	if _infile_^="" & i then put _infile_;
run;

data page;
	infile "!userprofile\desktop\page.htm" dlm='"' truncover;
	input @;
	if _infile_=:'<a class="p' then do;
		input @'href="' href :$160.;
		input title :$160.;
	end;
run;

I'd like to give more detailed directions as the data aren't readily available.

ballardw
Super User

This works with your original example:

data want1;
	infile cards dlm=" " dsd truncover;
	input @"name=" name :$16. @'value=' value :$16. @'title=' title :$32. @'etc=' etc :$16.;
   length span $ 16;
   startpos = index(_infile_,'<span>')+6;
   endpos   =  index(_infile_,'</span>');
   span = substr(_infile_,startpos,endpos-startpos);
cards;
name='united states' value="123456" title="<b>results 2016</b>" <span>new york</span> etc="jimmy"
name='canada' value="12345" title="yearly statistics" <span>toronto</span> etc="dawn"
name='japan' value="1234567" title="annual bulletin" <span>osaka</span> etc="fujita"
;

Note that I removed the quotes from the DLM. Values inside quotes will be read just fine. So also removed the quote ending the @name strings. If you change the informat on Value to 16. it will read as number if that is a desired result.

The DSD option prevents the spaces inside the quoted strings from use as delimiters.

 

Warning: you may want to specify a larger than 256 for your infile URL. I think the URL may default to 256, at least the FILENAME statement will default to that for when using a URL Filename statement.

 

You would likely want to drop the Startpos and Endpos variables I created for finding the value of Span. Or move the Index function calls into the substr call.

ChrisNZ
Tourmaline | Level 20

Another way:

 

data WANT;
  infile cards dlm="'""" truncover ;
  input @'name=' NAME $:20. @'value=' VALUE $:20. @'title=' TITLE $:20.  @'<span>' CITY $:20. ETC $:20. ;
  CITY=scan(CITY,1,'<');  
cards;
name='united states' value="123456" title="<b>results 2016</b>" <span>new york</span> etc="jimmy"
name='canada' value="12345" title="yearly statistics" <span>toronto</span> etc="dawn"
name='japan' value="1234567" title="annual bulletin" <span>osaka</span> etc="fujita"
;

 

 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1022 views
  • 0 likes
  • 3 in conversation