BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
robm
Quartz | Level 8

I have file that has contents like

<td VALIGN=BASELINE>

<a href="cleopatra/index.html">Antony and Cleopatra</a>

<br><a href="coriolanus/index.html">Coriolanus</a>

<br><a href="hamlet/index.html">Hamlet</a>

and what is outupt by this code:

data _null_;

length line1 $30 line2 $100 ;

infile main DELIMITER='href=';

input @;

  input line1 line2;

put line1= line2= ;

run;

is

line1=<td VALIGN line2=BASELINE>

line1=<a line2="cl

line1=<b line2=><a

line1=<b line2=><a

what i would like to have output is only the lines with index.html in them

line1="cleopatra/index.html" line2=Antony and Cleopatra

line1="coriolanus/index.html" line2=Coriolanus

line1="hamlet/index.html" line2=Hamlet

any ideas on how to do this?

1 ACCEPTED SOLUTION

Accepted Solutions
Cynthia_sas
SAS Super FREQ

Hi:

  Here's a sample program that illustrates the logic using DATALINES or "in-stream" data, as an example of the type of parsing that was suggested..

Cynthia

data _null_;

  length bigline $100 line1 $50 line2 $100 ;

  infile datalines truncover;

  ** read the whole line;

  input @1 bigline $100.;

       

  ** then parse the lines with HREF and INDEX.HTML only;

  if find(upcase(bigline),'HREF') gt 0 and

     find(upcase(bigline),'INDEX.HTML') gt 0 then do;

    line1 = scan(bigline,2,'"');

    line2 = scan(bigline,-3,'<>');

    put _n_= line1= line2=;

    output;

  end;

return;

datalines4;

<td VALIGN=BASELINE>

<a href="cleopatra/index.html">Antony and Cleopatra</a>

<br><a href="coriolanus/index.html">Coriolanus</a>

<br><a href="hamlet/index.html">Hamlet</a>

;;;;

run;


parse_html.png

View solution in original post

7 REPLIES 7
ballardw
Super User

When use use delimeter = 'href' it will trean any of the individual characters as a delimeter, not the whole string.

So since "r" is a delimiter you <b as the "r" was used as a delimiter. Use dlmstr='href' to get the behavior you're looking for. You may

Read the whole line into a single variable and use one of the string search functions such as FINDW or INDEXW. With your example lines you'll need to include / and . in delimiters of the function.

if findw(upcase(string),'INDEX',' /.;:') = 0 then delete;

robm
Quartz | Level 8

cool

can I ask how I would read it into a var

I tried length line1 $30 line2 $100 ;

Cynthia_sas
SAS Super FREQ

Hi:

  Here's a sample program that illustrates the logic using DATALINES or "in-stream" data, as an example of the type of parsing that was suggested..

Cynthia

data _null_;

  length bigline $100 line1 $50 line2 $100 ;

  infile datalines truncover;

  ** read the whole line;

  input @1 bigline $100.;

       

  ** then parse the lines with HREF and INDEX.HTML only;

  if find(upcase(bigline),'HREF') gt 0 and

     find(upcase(bigline),'INDEX.HTML') gt 0 then do;

    line1 = scan(bigline,2,'"');

    line2 = scan(bigline,-3,'<>');

    put _n_= line1= line2=;

    output;

  end;

return;

datalines4;

<td VALIGN=BASELINE>

<a href="cleopatra/index.html">Antony and Cleopatra</a>

<br><a href="coriolanus/index.html">Coriolanus</a>

<br><a href="hamlet/index.html">Hamlet</a>

;;;;

run;


parse_html.png
Ksharp
Super User

data _null_;

  infile datalines flowover  dlm='<>"';

  input @'href="' a : $100. @'>' b : $100. ;

  put a= b=;

datalines4;

<td VALIGN=BASELINE>

<a href="cleopatra/index.html">Antony and Cleopatra</a>

<br><a href="coriolanus/index.html">Coriolanus</a>

<br><a href="hamlet/index.html">Hamlet</a>

;;;;

run;

Xia Keshan

Message was edited by: xia keshan

robm
Quartz | Level 8

ok cool one last thing

sometimes I have lines like this so line2 is blank until the second line is read

<a href="allswell/index.html">

All's Well That Ends Well</a>

<a href="asyoulikeit/index.html">

As You Like It</a>

so I put in logic like

if trim(line2) EQ "" then do

      oldline1 = line1;

end;

    /*put _n_= line1= line2=;*/

if trim(line2) NE "" then do;
  if trim(line1) EQ "" then do;
   line1 = oldline1;
   put '------------line1 blank ' line1= line2= oldline1=;
  end;
  put 'http://shakespeare.mit.edu/' line1 line2 oldline1= ;

end;

so it wont be printed until line2 is populated and line1 is just reassigned to the last line1 (the value of oldline1)

but somewhere oldline1 is getting reset any ideas

ballardw
Super User

You'll need to add: length oldline1 $xxx ; making xxx large enough to hold all the characters expected and then Retain oldline1; to keep the value across records.

AND likely want to reset it to blank when no longer needed.

Ksharp
Super User

OK. treat is as a stream file .

data x;

infile 'c:\temp\sample.txt' recfm=n  dlm='<>"';

  input x : $100. @@ ;

  if lag(x) = 'a href=' or lag2(x)='a href=';

run;

Xia Keshan

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1314 views
  • 1 like
  • 4 in conversation