Solved: Extracting Data From Webpages

Scott_Mitchell · Posted 08-20-2013 02:58 AM

Hi Guys,

I am presently in the process of scraping data from a webpage using the following code:

FILENAME FOO URL 'http://www.realestate.com.au/rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+ben...

DATA _NULL_;

INFILE FOO LENGTH=LEN LRECL=200000;

RETAIN NEXT;

INPUT RECORD $VARYING25000. LEN;

RECORD = COMPRESS(RECORD);

IF FIND(UPCASE(RECORD),'<LINKREL="NEXT"HREF=') THEN DO;

START = FIND(UPCASE(RECORD),'<LINKREL="NEXT"HREF=') + LENGTH('"<LINKREL="NEXT"HREF=');

ADDLEN = FIND(UPCASE(RECORD),"/><LINKREL='STYLESHEET'")-1 - START;

NEXT = SUBSTR(RECORD,START,ADDLEN);

END;

IF UPCASE(RECORD) =: "DATA-FEATURED";

START = FIND(UPCASE(RECORD),"/",FIND(UPCASE(RECORD),"<H2><AHREF="));

ADDLEN = FIND(UPCASE(RECORD),"'CLASS=") - START;

URL = "http://www.realestate.com.au"||SUBSTR(RECORD,START,ADDLEN);

I +1;

DROP RECORD I;

CALL EXECUTE("FILENAME FOO URL '" || COMPRESS(URL) || "';");

CALL EXECUTE("DATA PAGE" || COMPRESS(I) || ";

INFILE FOO LENGTH=LEN LRECL=60000;

INPUT RECORD $VARYING25000. LEN;

RUN;");

/*IF EOF THEN CALL EXECUTE("FILENAME FOO URL '" || COMPRESS(NEXT) || "';");*/

RUN;

It works as expected, but the next step is to alter my CALL EXECUTE that will extract data from the relevant pages captured by the _NULL_ datastep. It appears that all the data I am interested in is contained within the 12th observation in tables PAGE1 - PAGE20. Although I am interested in many of the items contained within this variable, I am most interested in Prices, Bedrooms, Car Spaces and Bathrooms.

Does anyone know it there is an alternative to a dozen SUBSTR functions to extract the data I am interested in?

In addition to this, I would also like to iterate through the search result pages simulating the use of the webpage "NEXT" button. Although I have been able to do this on a one off basis, I need to do so after the last record in the _NULL_ datastep has been processed. I can't use the END = INFILE option, because of the subsetting IF's. Does anyone have an alternative that I could employ in this instance?

Thank you very much for your help.

Regards,

Scott

FriedEgg · Posted 08-20-2013 06:12 PM

This should give you a push in the right direction.

filename foo url 'http://www.realestate.com.au/rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+ben...

data bar;

length url $ 256 uri $ 128;

if _n_=1 then do;

dcl hash ha();

dcl hiter hi('ha');

ha.definekey( 'uri' );

ha.definedata( 'url' );

ha.definedone();

end;

infile foo length=len lrecl=200000 scanover dlm='''' eof=done;

input @'<div class=''media''><a href=''/' uri : @;

url = catx( '/' , 'http://www.realestate.com.au' , uri );

ha.ref();

return;

done: do;

put 'done';

length c $ 32767;

call missing(c);

drop rc prx1-prx6 fid pos len c;

rc=hi.first();

array prx[6];

array r[6] $ 32 price propertytype bedrooms bathroons bond garage;

prx1=prxparse('#(\$\d+\s[^<]+)#');

prx2=prxparse('#([^<]+)#');

prx3=prxparse('#<li>Bedrooms:(\d+)</li>#');

prx4=prxparse('#<li>Bathrooms:(\d+)</li>#');

prx5=prxparse('#<li>Bond:([^<]+)</li>#');

prx6=prxparse('#<li>Garage\sSpaces:(\d+)</li>#');

do while(rc=0);

rc=filename('bar',url,'url');

if (rc ne 0) then _error_=1;

fid=fopen('bar','i',32767,'v');

if (fid eq 0) then _error_=1;

do while(fread(fid)=0);

rc=fget(fid,c,32767);

do _n_=1 to 6;

if prxmatch(prx[_n_],c) gt 1 then do;

call prxposn(prx[_n_],1,pos,len);

r[_n_]=substr(c,pos,len);

end;

output;

rc=fclose(fid);

rc=filename('bar');

rc=hi.next();

end;

run;

filename foo clear;

url	price	propertytype	bedrooms	bathroons	bond	garage
http://www.realestate.com.au/property-apartment-qld-southport-406820916	$380 per week	Apartment	2	2	$1,680	1
http://www.realestate.com.au/property-apartment-qld-southport-410922083	$470 per week	Apartment	1	2	$1,880	1
http://www.realestate.com.au/property-house-qld-bundall-411154235	$575 per week	House	4	2	$1,880	2
http://www.realestate.com.au/property-apartment-qld-southport-411129439	$350 per week	Apartment	1	1	$1,440	1
http://www.realestate.com.au/property-apartment-qld-southport-410932091	$510 per week	Apartment	2	2	$2,200	1
http://www.realestate.com.au/property-duplex+semi+detached-qld-southport-405338877	$380 per week	Duplex/Semi-Detached	3	1	$1,600	1
http://www.realestate.com.au/property-apartment-qld-southport-408408187	$360 Weekly	Apartment	2	2	$1,800	1
http://www.realestate.com.au/property-house-qld-benowa-411165719	$720 p.w.	House	4	2	$1,800	2
http://www.realestate.com.au/property-duplex+semi+detached-qld-southport-411138847	$370 / Wk	Duplex/Semi-Detached	3	2	$1,840	2
http://www.realestate.com.au/property-unit-qld-southport-404970368	$395 p.w.	Unit	3	2	$1,480	2
http://www.realestate.com.au/property-house-qld-benowa-411153071	$570 per week	House	4	2	$2,200	2
http://www.realestate.com.au/property-unit-qld-southport-404180032	$345 per week	Unit	1	1	$1,400	2
http://www.realestate.com.au/property-apartment-qld-southport-405366318	$450 per week	Apartment	2	2	$21,060	2
http://www.realestate.com.au/property-apartment-qld-southport-411087547	$360 / Wk	Apartment	2	2	$1,480	1
http://www.realestate.com.au/property-apartment-qld-southport-404278010	$520 per week	Apartment	2	2	$1,880	1
http://www.realestate.com.au/property-unit-qld-southport-411131067	$550 Weekly	Unit	3	2	$1,880	1
http://www.realestate.com.au/property-apartment-qld-southport-406957606	$550 per week	Apartment	2	2	$2,080	1
http://www.realestate.com.au/property-duplex+semi+detached-qld-labrador-410153847	$290 per week	Duplex/Semi-Detached	2	1	$1,400	1
http://www.realestate.com.au/property-townhouse-qld-labrador-411055503	$470 p.w.	Townhouse	3	2	$1,680	2
http://www.realestate.com.au/property-apartment-qld-southport-411030115	$405 per week	Apartment	2	2	$1,800	2
http://www.realestate.com.au/property-house-qld-labrador-411167291	$370 per week	House	3	1	$1,520	2
http://www.realestate.com.au/property-unit-qld-labrador-407527521	$320 p.w.	Unit	2	1	$1,160	2

View solution in original post

Vince28_Statcan · Posted 08-20-2013 08:00 AM

Yes, the natural escape to substrings and scans and conditioning is the use of Regular Expressions via the PRXsomething functions and call routines. Or, if your website target offers webservices, you could use soap and read the file as xml and use a custom map/soap request to achieve the results

I can't quite help you with regex syntax unless you provide either your targetted strings/substrings or somehow provide the logic of your chain of substr functions. But typically, you can create a single regex that can achieve all of your conditions for the substr chain and only extract one piece at a time based on the PRX results using a single substr statement, looping on your text using PRXNEXT

However, if your web pages always have exactly the same format, it might be best to preprocess the data. Like there should be some obvious markup tags that distinguish where your relevant data is found and you'd likely save a lot of time using the markup languages strenght. Even moreso if the URL respects xml html in which case you can probably just strip a header line off of your URL file and read the whole thing as an XML with the XML engine and go directly to the tag of interest with a custom map.

Long story short, many different options are availible depending on how your target website is setup. Using the power of xml is the one that will cause the fewer headaches like having to debug the regex or the substring chain logic but not all websites are using xml standards.

FriedEgg · Posted 08-20-2013 06:12 PM

This should give you a push in the right direction.

filename foo url 'http://www.realestate.com.au/rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+ben...

data bar;

length url $ 256 uri $ 128;

if _n_=1 then do;

dcl hash ha();

dcl hiter hi('ha');

ha.definekey( 'uri' );

ha.definedata( 'url' );

ha.definedone();

end;

infile foo length=len lrecl=200000 scanover dlm='''' eof=done;

input @'<div class=''media''><a href=''/' uri : @;

url = catx( '/' , 'http://www.realestate.com.au' , uri );

ha.ref();

return;

done: do;

put 'done';

length c $ 32767;

call missing(c);

drop rc prx1-prx6 fid pos len c;

rc=hi.first();

array prx[6];

array r[6] $ 32 price propertytype bedrooms bathroons bond garage;

prx1=prxparse('#(\$\d+\s[^<]+)#');

prx2=prxparse('#([^<]+)#');

prx3=prxparse('#<li>Bedrooms:(\d+)</li>#');

prx4=prxparse('#<li>Bathrooms:(\d+)</li>#');

prx5=prxparse('#<li>Bond:([^<]+)</li>#');

prx6=prxparse('#<li>Garage\sSpaces:(\d+)</li>#');

do while(rc=0);

rc=filename('bar',url,'url');

if (rc ne 0) then _error_=1;

fid=fopen('bar','i',32767,'v');

if (fid eq 0) then _error_=1;

do while(fread(fid)=0);

rc=fget(fid,c,32767);

do _n_=1 to 6;

if prxmatch(prx[_n_],c) gt 1 then do;

call prxposn(prx[_n_],1,pos,len);

r[_n_]=substr(c,pos,len);

end;

output;

rc=fclose(fid);

rc=filename('bar');

rc=hi.next();

end;

run;

filename foo clear;

url	price	propertytype	bedrooms	bathroons	bond	garage
http://www.realestate.com.au/property-apartment-qld-southport-406820916	$380 per week	Apartment	2	2	$1,680	1
http://www.realestate.com.au/property-apartment-qld-southport-410922083	$470 per week	Apartment	1	2	$1,880	1
http://www.realestate.com.au/property-house-qld-bundall-411154235	$575 per week	House	4	2	$1,880	2
http://www.realestate.com.au/property-apartment-qld-southport-411129439	$350 per week	Apartment	1	1	$1,440	1
http://www.realestate.com.au/property-apartment-qld-southport-410932091	$510 per week	Apartment	2	2	$2,200	1
http://www.realestate.com.au/property-duplex+semi+detached-qld-southport-405338877	$380 per week	Duplex/Semi-Detached	3	1	$1,600	1
http://www.realestate.com.au/property-apartment-qld-southport-408408187	$360 Weekly	Apartment	2	2	$1,800	1
http://www.realestate.com.au/property-house-qld-benowa-411165719	$720 p.w.	House	4	2	$1,800	2
http://www.realestate.com.au/property-duplex+semi+detached-qld-southport-411138847	$370 / Wk	Duplex/Semi-Detached	3	2	$1,840	2
http://www.realestate.com.au/property-unit-qld-southport-404970368	$395 p.w.	Unit	3	2	$1,480	2
http://www.realestate.com.au/property-house-qld-benowa-411153071	$570 per week	House	4	2	$2,200	2
http://www.realestate.com.au/property-unit-qld-southport-404180032	$345 per week	Unit	1	1	$1,400	2
http://www.realestate.com.au/property-apartment-qld-southport-405366318	$450 per week	Apartment	2	2	$21,060	2
http://www.realestate.com.au/property-apartment-qld-southport-411087547	$360 / Wk	Apartment	2	2	$1,480	1
http://www.realestate.com.au/property-apartment-qld-southport-404278010	$520 per week	Apartment	2	2	$1,880	1
http://www.realestate.com.au/property-unit-qld-southport-411131067	$550 Weekly	Unit	3	2	$1,880	1
http://www.realestate.com.au/property-apartment-qld-southport-406957606	$550 per week	Apartment	2	2	$2,080	1
http://www.realestate.com.au/property-duplex+semi+detached-qld-labrador-410153847	$290 per week	Duplex/Semi-Detached	2	1	$1,400	1
http://www.realestate.com.au/property-townhouse-qld-labrador-411055503	$470 p.w.	Townhouse	3	2	$1,680	2
http://www.realestate.com.au/property-apartment-qld-southport-411030115	$405 per week	Apartment	2	2	$1,800	2
http://www.realestate.com.au/property-house-qld-labrador-411167291	$370 per week	House	3	1	$1,520	2
http://www.realestate.com.au/property-unit-qld-labrador-407527521	$320 p.w.	Unit	2	1	$1,160	2

Scott_Mitchell · Posted 08-20-2013 09:25 PM

That is an outstanding solution. Thank you very much for taking the time to code it.

I understand everything in the code except for the need to define the hash object as a hash iterator. Can I ask why this is the case?

Thank you again for your help and for Vince's suggestion.

FriedEgg · Posted 08-20-2013 10:59 PM

The reason is fairly simple. The hash is used to collect a unique list of url's to scrape, the iterator helps me loop over the list as part of the done action (the first and next methods on the hi object).

Scott_Mitchell · Posted 08-22-2013 06:10 AM

Hi FriedEgg,

Thank you for that additional information.

Another question if I could.

Before going to each of the individual search results I would like to compile a list of every property returned by the search itself (currently this would be carried out on a page by page basis) by using the value associated with the next button, I assume I could do this using an iterative do loop in this instance, but I would prefer to use the next buttons value in case I have a future requirement to do so.

Sorry to be a pain. I rarely use txt files, so am using this scenario to build my skills.

Thanks again for your help.

Regards,

Scott

FriedEgg · Posted 08-23-2013 09:01 AM

data _null_;

length url $ 280 c $ 32767;

declare hash ha();

ha.definekey('url');

ha.definedata('url');

ha.definedone();

ip='www.realestate.com.au';

path='rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+benowa%2c+qld+4217%3b+benowa+waters%2c+qld+4217';

options='includeSurrounding=false';

prx=prxparse('#(?<=''/)(property-\w+-[^'']+)(?='')#');

prxnext=prxparse('#(?<=)(Showing\s\d+\s-\s(\d+)\sof\s(\d+)?\sTotal\sResults)(?=)#');

do page=1 by 1 until(done);

search_path=cats( 'http://' , ip , '/' , path , '/' , 'list-' , page , '?' , options );

rc=filename( 'foo' , search_path , 'url' );

if ( rc ne 0 ) then link err;

fid=fopen( 'foo' , 'i' , 32767 , 'v' );

if ( fid eq 0 ) then link err;

do while ( fread( fid ) = 0 );

rc=fget(fid,c,32767);

if prxmatch(prxnext,c) gt 1 then do;

if prxparen(prxnext) eq 3 then do;

x=input(prxposn(prxnext,2,c),best.);

y=input(prxposn(prxnext,3,c),best.);

if x=y then done=1;

end;

if prxmatch(prx,c) gt 1 then do;

call prxposn(prx,1,pos,len);

url=cats( 'http://' , ip , '/' , substr(c,pos,len) );

ha.ref();

end;

rc=fclose(fid);

rc=filename('foo');

end;

rc=ha.output(dataset:'foo');

stop;

err: do;

_error_=1; stop;

end;

run;

Ksharp · Posted 06-04-2014 11:59 AM

I also can do it simpler.

FILENAME FOO URL 'http://www.realestate.com.au/rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+benowa%2c+qld+4217%3b+benowa+waters%2c+qld+4217/list-1?includeSurrounding=false';
data x;
 infile foo lrecl=34556727 dlm='<>"=';
 input x : $100. @@;
run;
data y;
 set x;
  if lag3(x) in ('Bedrooms' 'Bathrooms' ) or (lag2(x) ="'media'" and lag(x)="a href" ) or lag(x) in ('priceText' "'propertyType'" );
run;
data z;
 set y;
 if index(x,"'") then do; n+1;x=cats( 'http://www.realestate.com.au',dequote(x) );end;
run;
data temp;
 set z;
 by n;
 if not (first.n and last.n) and n ne 0;
run;

Xia Keshan

Extracting Data From Webpages

Re: Extracting Data From Webpages

Re: Extracting Data From Webpages

Re: Extracting Data From Webpages

Re: Extracting Data From Webpages

Re: Extracting Data From Webpages

Re: Extracting Data From Webpages

Re: Extracting Data From Webpages

Re: Extracting Data From Webpages

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away