DATA Step, Macro, Functions and more

Extracting Data From Webpages

Accepted Solution Solved
Reply
Super Contributor
Posts: 297
Accepted Solution

Extracting Data From Webpages

Hi Guys,

I am presently in the process of scraping data from a webpage using the following code:

FILENAME FOO URL 'http://www.realestate.com.au/rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+ben...

DATA _NULL_;

  INFILE FOO LENGTH=LEN LRECL=200000;

  RETAIN NEXT;

  INPUT RECORD $VARYING25000. LEN;

  RECORD = COMPRESS(RECORD);

  IF FIND(UPCASE(RECORD),'<LINKREL="NEXT"HREF=') THEN DO;

  START  = FIND(UPCASE(RECORD),'<LINKREL="NEXT"HREF=') + LENGTH('"<LINKREL="NEXT"HREF=');

  ADDLEN = FIND(UPCASE(RECORD),"/><LINKREL='STYLESHEET'")-1 - START;

  NEXT   = SUBSTR(RECORD,START,ADDLEN);

  END;

  IF UPCASE(RECORD) =: "DATA-FEATURED";

  START  = FIND(UPCASE(RECORD),"/",FIND(UPCASE(RECORD),"<H2><AHREF="));

  ADDLEN = FIND(UPCASE(RECORD),"'CLASS=") - START;

  URL = "http://www.realestate.com.au"||SUBSTR(RECORD,START,ADDLEN);

  I +1;

  DROP RECORD I;

  CALL EXECUTE("FILENAME FOO URL '" || COMPRESS(URL) || "';");

  CALL EXECUTE("DATA PAGE" || COMPRESS(I) || ";

  INFILE FOO LENGTH=LEN LRECL=60000;

  INPUT RECORD $VARYING25000. LEN;

  RUN;");

  /*IF EOF THEN CALL EXECUTE("FILENAME FOO URL '" || COMPRESS(NEXT) || "';");*/

RUN;

It works as expected, but the next step is to alter my CALL EXECUTE that will extract data from the relevant pages captured by the _NULL_ datastep.  It appears that all the data I am interested in is contained within the 12th observation in tables PAGE1 - PAGE20.  Although I am interested in many of the items contained within this variable, I am most interested in Prices, Bedrooms, Car Spaces and Bathrooms.

Does anyone know it there is an alternative to a dozen SUBSTR functions to extract the data I am interested in?

In addition to this, I would also like to iterate through the search result pages simulating the use of the webpage "NEXT" button.  Although I have been able to do this on a one off basis, I need to do so after the last record in the _NULL_ datastep has been processed.  I can't use the END = INFILE option, because of the subsetting IF's.  Does anyone have an alternative that I could employ in this instance?

Thank you very much for your help.

Regards,

Scott


Accepted Solutions
Solution
‎08-20-2013 06:12 PM
Trusted Advisor
Posts: 1,300

Re: Extracting Data From Webpages

This should give you a push in the right direction.

filename foo url 'http://www.realestate.com.au/rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+ben...

data bar;

    length url $ 256 uri $ 128;

  if _n_=1 then do;

        dcl hash ha();

            dcl hiter hi('ha');

            ha.definekey( 'uri' );

            ha.definedata( 'url' );

          ha.definedone();

      end;

    infile foo length=len lrecl=200000 scanover dlm='''' eof=done;

  input @'<div class=''media''><a href=''/' uri : @;

  url = catx( '/' , 'http://www.realestate.com.au' , uri );

    ha.ref();

    return;

  done: do;

     put 'done';

     length c $ 32767;

  call missing(c);

  drop rc prx1-prx6 fid pos len c;

  rc=hi.first();

  array prx[6];

  array r[6] $ 32 price propertytype bedrooms bathroons bond garage;

  prx1=prxparse('#<p class=''price''><span class="hidden">(\$\d+\s[^<]+)</span></p>#');

        prx2=prxparse('#<span class=''propertyType''>([^<]+)</span>#');

  prx3=prxparse('#<li>Bedrooms:<span>(\d+)</span></li>#');

  prx4=prxparse('#<li>Bathrooms:<span>(\d+)</span></li>#');

  prx5=prxparse('#<li>Bond:<span>([^<]+)</span></li>#');

  prx6=prxparse('#<li>Garage\sSpaces:<span>(\d+)</span></li>#');

        do while(rc=0);

     rc=filename('bar',url,'url');

  if (rc ne 0) then _error_=1;

  fid=fopen('bar','i',32767,'v');

  if (fid eq 0) then _error_=1;

  do while(fread(fid)=0);

     rc=fget(fid,c,32767);

  do _n_=1 to 6;

     if prxmatch(prx[_n_],c) gt 1 then do;

         call prxposn(prx[_n_],1,pos,len);

     r[_n_]=substr(c,pos,len);

                      end;

   end;

   end;

  output;

  rc=fclose(fid);

  rc=filename('bar');

     rc=hi.next();

   end;

   end;

  run;

filename foo clear;

urlpricepropertytypebedroomsbathroonsbondgarage
http://www.realestate.com.au/property-apartment-qld-southport-406820916$380 per weekApartment22$1,6801
http://www.realestate.com.au/property-apartment-qld-southport-410922083$470 per weekApartment12$1,8801
http://www.realestate.com.au/property-house-qld-bundall-411154235$575 per weekHouse42$1,8802
http://www.realestate.com.au/property-apartment-qld-southport-411129439$350 per weekApartment11$1,4401
http://www.realestate.com.au/property-apartment-qld-southport-410932091$510 per weekApartment22$2,2001
http://www.realestate.com.au/property-duplex+semi+detached-qld-southport-405338877$380 per weekDuplex/Semi-Detached31$1,6001
http://www.realestate.com.au/property-apartment-qld-southport-408408187$360 WeeklyApartment22$1,8001
http://www.realestate.com.au/property-house-qld-benowa-411165719$720 p.w.House42$1,8002
http://www.realestate.com.au/property-duplex+semi+detached-qld-southport-411138847$370 / WkDuplex/Semi-Detached32$1,8402
http://www.realestate.com.au/property-unit-qld-southport-404970368$395 p.w.Unit32$1,4802
http://www.realestate.com.au/property-house-qld-benowa-411153071$570 per weekHouse42$2,2002
http://www.realestate.com.au/property-unit-qld-southport-404180032$345 per weekUnit11$1,4002
http://www.realestate.com.au/property-apartment-qld-southport-405366318$450 per weekApartment22$21,0602
http://www.realestate.com.au/property-apartment-qld-southport-411087547$360 / WkApartment22$1,4801
http://www.realestate.com.au/property-apartment-qld-southport-404278010$520 per weekApartment22$1,8801
http://www.realestate.com.au/property-unit-qld-southport-411131067$550 WeeklyUnit32$1,8801
http://www.realestate.com.au/property-apartment-qld-southport-406957606$550 per weekApartment22$2,0801
http://www.realestate.com.au/property-duplex+semi+detached-qld-labrador-410153847$290 per weekDuplex/Semi-Detached21$1,4001
http://www.realestate.com.au/property-townhouse-qld-labrador-411055503$470 p.w.Townhouse32$1,6802
http://www.realestate.com.au/property-apartment-qld-southport-411030115$405 per weekApartment22$1,8002
http://www.realestate.com.au/property-house-qld-labrador-411167291$370 per weekHouse31$1,5202
http://www.realestate.com.au/property-unit-qld-labrador-407527521$320 p.w.Unit21$1,1602

View solution in original post


All Replies
Super Contributor
Posts: 339

Re: Extracting Data From Webpages

Yes, the natural escape to substrings and scans and conditioning is the use of Regular Expressions via the PRXsomething functions and call routines. Or, if your website target offers webservices, you could use soap and read the file as xml and use a custom map/soap request to achieve the results

I can't quite help you with regex syntax unless you provide either your targetted strings/substrings or somehow provide the logic of your chain of substr functions. But typically, you can create a single regex that can achieve all of your conditions for the substr chain and only extract one piece at a time based on the PRX results using a single substr statement, looping on your text using PRXNEXT

However, if your web pages always have exactly the same format, it might be best to preprocess the data. Like there should be some obvious markup tags that distinguish where your relevant data is found and you'd likely save a lot of time using the markup languages strenght. Even moreso if the URL respects xml html in which case you can probably just strip a header line off of your URL file and read the whole thing as an XML with the XML engine and go directly to the tag of interest with a custom map.

Long story short, many different options are availible depending on how your target website is setup. Using the power of xml is the one that will cause the fewer headaches like having to debug the regex or the substring chain logic but not all websites are using xml standards.

Solution
‎08-20-2013 06:12 PM
Trusted Advisor
Posts: 1,300

Re: Extracting Data From Webpages

This should give you a push in the right direction.

filename foo url 'http://www.realestate.com.au/rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+ben...

data bar;

    length url $ 256 uri $ 128;

  if _n_=1 then do;

        dcl hash ha();

            dcl hiter hi('ha');

            ha.definekey( 'uri' );

            ha.definedata( 'url' );

          ha.definedone();

      end;

    infile foo length=len lrecl=200000 scanover dlm='''' eof=done;

  input @'<div class=''media''><a href=''/' uri : @;

  url = catx( '/' , 'http://www.realestate.com.au' , uri );

    ha.ref();

    return;

  done: do;

     put 'done';

     length c $ 32767;

  call missing(c);

  drop rc prx1-prx6 fid pos len c;

  rc=hi.first();

  array prx[6];

  array r[6] $ 32 price propertytype bedrooms bathroons bond garage;

  prx1=prxparse('#<p class=''price''><span class="hidden">(\$\d+\s[^<]+)</span></p>#');

        prx2=prxparse('#<span class=''propertyType''>([^<]+)</span>#');

  prx3=prxparse('#<li>Bedrooms:<span>(\d+)</span></li>#');

  prx4=prxparse('#<li>Bathrooms:<span>(\d+)</span></li>#');

  prx5=prxparse('#<li>Bond:<span>([^<]+)</span></li>#');

  prx6=prxparse('#<li>Garage\sSpaces:<span>(\d+)</span></li>#');

        do while(rc=0);

     rc=filename('bar',url,'url');

  if (rc ne 0) then _error_=1;

  fid=fopen('bar','i',32767,'v');

  if (fid eq 0) then _error_=1;

  do while(fread(fid)=0);

     rc=fget(fid,c,32767);

  do _n_=1 to 6;

     if prxmatch(prx[_n_],c) gt 1 then do;

         call prxposn(prx[_n_],1,pos,len);

     r[_n_]=substr(c,pos,len);

                      end;

   end;

   end;

  output;

  rc=fclose(fid);

  rc=filename('bar');

     rc=hi.next();

   end;

   end;

  run;

filename foo clear;

urlpricepropertytypebedroomsbathroonsbondgarage
http://www.realestate.com.au/property-apartment-qld-southport-406820916$380 per weekApartment22$1,6801
http://www.realestate.com.au/property-apartment-qld-southport-410922083$470 per weekApartment12$1,8801
http://www.realestate.com.au/property-house-qld-bundall-411154235$575 per weekHouse42$1,8802
http://www.realestate.com.au/property-apartment-qld-southport-411129439$350 per weekApartment11$1,4401
http://www.realestate.com.au/property-apartment-qld-southport-410932091$510 per weekApartment22$2,2001
http://www.realestate.com.au/property-duplex+semi+detached-qld-southport-405338877$380 per weekDuplex/Semi-Detached31$1,6001
http://www.realestate.com.au/property-apartment-qld-southport-408408187$360 WeeklyApartment22$1,8001
http://www.realestate.com.au/property-house-qld-benowa-411165719$720 p.w.House42$1,8002
http://www.realestate.com.au/property-duplex+semi+detached-qld-southport-411138847$370 / WkDuplex/Semi-Detached32$1,8402
http://www.realestate.com.au/property-unit-qld-southport-404970368$395 p.w.Unit32$1,4802
http://www.realestate.com.au/property-house-qld-benowa-411153071$570 per weekHouse42$2,2002
http://www.realestate.com.au/property-unit-qld-southport-404180032$345 per weekUnit11$1,4002
http://www.realestate.com.au/property-apartment-qld-southport-405366318$450 per weekApartment22$21,0602
http://www.realestate.com.au/property-apartment-qld-southport-411087547$360 / WkApartment22$1,4801
http://www.realestate.com.au/property-apartment-qld-southport-404278010$520 per weekApartment22$1,8801
http://www.realestate.com.au/property-unit-qld-southport-411131067$550 WeeklyUnit32$1,8801
http://www.realestate.com.au/property-apartment-qld-southport-406957606$550 per weekApartment22$2,0801
http://www.realestate.com.au/property-duplex+semi+detached-qld-labrador-410153847$290 per weekDuplex/Semi-Detached21$1,4001
http://www.realestate.com.au/property-townhouse-qld-labrador-411055503$470 p.w.Townhouse32$1,6802
http://www.realestate.com.au/property-apartment-qld-southport-411030115$405 per weekApartment22$1,8002
http://www.realestate.com.au/property-house-qld-labrador-411167291$370 per weekHouse31$1,5202
http://www.realestate.com.au/property-unit-qld-labrador-407527521$320 p.w.Unit21$1,1602
Super Contributor
Posts: 297

Re: Extracting Data From Webpages

That is an outstanding solution.  Thank you very much for taking the time to code it.

I understand everything in the code except for the need to define the hash object as a hash iterator.  Can I ask why this is the case?

Thank you again for your help and for Vince's suggestion.

Trusted Advisor
Posts: 1,300

Re: Extracting Data From Webpages

The reason is fairly simple.  The hash is used to collect a unique list of url's to scrape, the iterator helps me loop over the list as part of the done action (the first and next methods on the hi object).

Super Contributor
Posts: 297

Re: Extracting Data From Webpages

Hi FriedEgg,

Thank you for that additional information.

Another question if I could.

Before going to each of the individual search results I would like to compile a list of every property returned by the search itself (currently this would be carried out on a page by page basis) by using the value associated with the next button,  I assume I could do this using an iterative do loop in this instance, but I would prefer to use the next buttons value in case I have a future requirement to do so. 

Sorry to be a pain.  I rarely use txt files, so am using this scenario to build my skills.

Thanks again for your help.

Regards,

Scott

Trusted Advisor
Posts: 1,300

Re: Extracting Data From Webpages

data _null_;

  length url $ 280 c $ 32767;

    declare hash ha();

        ha.definekey('url');

  ha.definedata('url');

      ha.definedone();

    ip='www.realestate.com.au';

    path='rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+benowa%2c+qld+4217%3b+benowa+waters%2c+qld+4217';

    options='includeSurrounding=false';

    prx=prxparse('#(?<=''/)(property-\w+-[^'']+)(?='')#');

    prxnext=prxparse('#(?<=<p>)(Showing\s\d+\s-\s(\d+)\sof\s(\d+)?\sTotal\sResults)(?=</p>)#');

    do page=1 by 1 until(done);

     search_path=cats( 'http://' , ip , '/' , path , '/' , 'list-' , page , '?' , options );

        rc=filename( 'foo' , search_path , 'url' );

  if ( rc ne 0 ) then link err;

  fid=fopen( 'foo' , 'i' , 32767 , 'v' );

     if ( fid eq 0 ) then link err;

  do while ( fread( fid ) = 0 );

     rc=fget(fid,c,32767);

  if prxmatch(prxnext,c) gt 1 then do;

     if prxparen(prxnext) eq 3 then do;

     x=input(prxposn(prxnext,2,c),best.);

  y=input(prxposn(prxnext,3,c),best.);

  if x=y then done=1;

   end;

   end;

            if prxmatch(prx,c) gt 1 then do;

                call prxposn(prx,1,pos,len);

                url=cats( 'http://' , ip , '/' , substr(c,pos,len) );

  ha.ref();

              end;

          end;

  rc=fclose(fid);

  rc=filename('foo');

   end;

  rc=ha.output(dataset:'foo');

  stop;

    err: do;

     _error_=1; stop;

   end;

  run;

Super User
Posts: 9,676

Re: Extracting Data From Webpages

I also can do it simpler.

FILENAME FOO URL 'http://www.realestate.com.au/rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+benowa%2c+qld+4217%3b+benowa+waters%2c+qld+4217/list-1?includeSurrounding=false';
data x;
 infile foo lrecl=34556727 dlm='<>"=';
 input x : $100. @@;
run;
data y;
 set x;
  if lag3(x) in ('Bedrooms' 'Bathrooms' ) or (lag2(x) ="'media'" and lag(x)="a href" ) or lag(x) in ('priceText' "'propertyType'" );
run;
data z;
 set y;
 if index(x,"'") then do; n+1;x=cats( 'http://www.realestate.com.au',dequote(x) );end;
run;
data temp;
 set z;
 by n;
 if not (first.n and last.n) and n ne 0;
run;

Xia Keshan

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 587 views
  • 3 likes
  • 4 in conversation