BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Junyong
Pyrite | Level 9

Suppose I read the following data.

/*SPAN is incorrectly read*/
data want1;
	infile cards dlm="'""" truncover;
	input @"name='" name :$16. @'value="' value :$16. @'title="' title :$32. @"<span>" span :$16. @'etc="' etc :$16.;
cards;
name='united states' value="123456" title="<b>results 2016</b>" <span>new york</span> etc="jimmy"
name='canada' value="12345" title="yearly statistics" <span>toronto</span> etc="dawn"
name='japan' value="1234567" title="annual bulletin" <span>osaka</span> etc="fujita"
;

Then SPAN will be read incorrectly. The problem is that I cannot include < because some other variables include it.

/*cannot include < as other variables contain it*/
data want2;
	infile cards dlm="'""<" truncover;
	input @"name='" name :$16. @'value="' value :$16. @'title="' title :$32. @"<span>" span :$16. @'etc="' etc :$16.;
cards;
name='united states' value="123456" title="<b>results 2016</b>" <span>new york</span> etc="jimmy"
name='canada' value="12345" title="yearly statistics" <span>toronto</span> etc="dawn"
name='japan' value="1234567" title="annual bulletin" <span>osaka</span> etc="fujita"
;

Thus, including < in DLM will influence TITLE. Instead, I saw DLMSTR, which can handle multi-character delimiters as follows. The problem, however, is that it also affects characters in DLM, so the results are totally changed as follows.

/*DLMSTR worsens this issue*/
data want3;
	infile cards dlm="'""" dlmstr="</span>" truncover;
	input @"name='" name :$16. @'value="' value :$16. @'title="' title :$32. @"<span>" span :$16. @'etc="' etc :$16.;
cards;
name='united states' value="123456" title="<b>results 2016</b>" <span>new york</span> etc="jimmy"
name='canada' value="12345" title="yearly statistics" <span>toronto</span> etc="dawn"
name='japan' value="1234567" title="annual bulletin" <span>osaka</span> etc="fujita"
;

In this case, can I control the ending point of SPAN via @ so that I can correctly read the strings inside (new york, toronto, and osaka)?

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

This works with your original example:

data want1;
	infile cards dlm=" " dsd truncover;
	input @"name=" name :$16. @'value=' value :$16. @'title=' title :$32. @'etc=' etc :$16.;
   length span $ 16;
   startpos = index(_infile_,'<span>')+6;
   endpos   =  index(_infile_,'</span>');
   span = substr(_infile_,startpos,endpos-startpos);
cards;
name='united states' value="123456" title="<b>results 2016</b>" <span>new york</span> etc="jimmy"
name='canada' value="12345" title="yearly statistics" <span>toronto</span> etc="dawn"
name='japan' value="1234567" title="annual bulletin" <span>osaka</span> etc="fujita"
;

Note that I removed the quotes from the DLM. Values inside quotes will be read just fine. So also removed the quote ending the @name strings. If you change the informat on Value to 16. it will read as number if that is a desired result.

The DSD option prevents the spaces inside the quoted strings from use as delimiters.

 

Warning: you may want to specify a larger than 256 for your infile URL. I think the URL may default to 256, at least the FILENAME statement will default to that for when using a URL Filename statement.

 

You would likely want to drop the Startpos and Endpos variables I created for finding the value of Span. Or move the Index function calls into the substr call.

View solution in original post

4 REPLIES 4
ballardw
Super User

What file type is this coming from? Some of the stuff looks like it might come from XML which has a LIBNAME to read (many) forms.

Or perhaps a website? Proc HTTP might be the way to go.

 

 

Junyong
Pyrite | Level 9

I'm trying to scrape webpages. The following minimum example collects the HREFs and titles of current Latest Post items from a SAS Support Communities homepage.

data _null_;
	infile "https://communities.sas.com/" url truncover;
	input;
	_infile_=strip(tranwrd(_infile_,"09"x,""));
	if _infile_="<!-- TAB ONE BEGINS -->" then i+1;
	else if _infile_="<!-- TAB ONE ENDS -->" then i+-1;
	file "!userprofile\desktop\page.htm";
	if _infile_^="" & i then put _infile_;
run;

data page;
	infile "!userprofile\desktop\page.htm" dlm='"' truncover;
	input @;
	if _infile_=:'<a class="p' then do;
		input @'href="' href :$160.;
		input title :$160.;
	end;
run;

I'd like to give more detailed directions as the data aren't readily available.

ballardw
Super User

This works with your original example:

data want1;
	infile cards dlm=" " dsd truncover;
	input @"name=" name :$16. @'value=' value :$16. @'title=' title :$32. @'etc=' etc :$16.;
   length span $ 16;
   startpos = index(_infile_,'<span>')+6;
   endpos   =  index(_infile_,'</span>');
   span = substr(_infile_,startpos,endpos-startpos);
cards;
name='united states' value="123456" title="<b>results 2016</b>" <span>new york</span> etc="jimmy"
name='canada' value="12345" title="yearly statistics" <span>toronto</span> etc="dawn"
name='japan' value="1234567" title="annual bulletin" <span>osaka</span> etc="fujita"
;

Note that I removed the quotes from the DLM. Values inside quotes will be read just fine. So also removed the quote ending the @name strings. If you change the informat on Value to 16. it will read as number if that is a desired result.

The DSD option prevents the spaces inside the quoted strings from use as delimiters.

 

Warning: you may want to specify a larger than 256 for your infile URL. I think the URL may default to 256, at least the FILENAME statement will default to that for when using a URL Filename statement.

 

You would likely want to drop the Startpos and Endpos variables I created for finding the value of Span. Or move the Index function calls into the substr call.

ChrisNZ
Tourmaline | Level 20

Another way:

 

data WANT;
  infile cards dlm="'""" truncover ;
  input @'name=' NAME $:20. @'value=' VALUE $:20. @'title=' TITLE $:20.  @'<span>' CITY $:20. ETC $:20. ;
  CITY=scan(CITY,1,'<');  
cards;
name='united states' value="123456" title="<b>results 2016</b>" <span>new york</span> etc="jimmy"
name='canada' value="12345" title="yearly statistics" <span>toronto</span> etc="dawn"
name='japan' value="1234567" title="annual bulletin" <span>osaka</span> etc="fujita"
;

 

 

 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1628 views
  • 0 likes
  • 3 in conversation