Hi Guys,
I am presently in the process of scraping data from a webpage using the following code:
FILENAME FOO URL 'http://www.realestate.com.au/rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+ben...
DATA _NULL_;
INFILE FOO LENGTH=LEN LRECL=200000;
RETAIN NEXT;
INPUT RECORD $VARYING25000. LEN;
RECORD = COMPRESS(RECORD);
IF FIND(UPCASE(RECORD),'<LINKREL="NEXT"HREF=') THEN DO;
START = FIND(UPCASE(RECORD),'<LINKREL="NEXT"HREF=') + LENGTH('"<LINKREL="NEXT"HREF=');
ADDLEN = FIND(UPCASE(RECORD),"/><LINKREL='STYLESHEET'")-1 - START;
NEXT = SUBSTR(RECORD,START,ADDLEN);
END;
IF UPCASE(RECORD) =: "DATA-FEATURED";
START = FIND(UPCASE(RECORD),"/",FIND(UPCASE(RECORD),"<H2><AHREF="));
ADDLEN = FIND(UPCASE(RECORD),"'CLASS=") - START;
URL = "http://www.realestate.com.au"||SUBSTR(RECORD,START,ADDLEN);
I +1;
DROP RECORD I;
CALL EXECUTE("FILENAME FOO URL '" || COMPRESS(URL) || "';");
CALL EXECUTE("DATA PAGE" || COMPRESS(I) || ";
INFILE FOO LENGTH=LEN LRECL=60000;
INPUT RECORD $VARYING25000. LEN;
RUN;");
/*IF EOF THEN CALL EXECUTE("FILENAME FOO URL '" || COMPRESS(NEXT) || "';");*/
RUN;
It works as expected, but the next step is to alter my CALL EXECUTE that will extract data from the relevant pages captured by the _NULL_ datastep. It appears that all the data I am interested in is contained within the 12th observation in tables PAGE1 - PAGE20. Although I am interested in many of the items contained within this variable, I am most interested in Prices, Bedrooms, Car Spaces and Bathrooms.
Does anyone know it there is an alternative to a dozen SUBSTR functions to extract the data I am interested in?
In addition to this, I would also like to iterate through the search result pages simulating the use of the webpage "NEXT" button. Although I have been able to do this on a one off basis, I need to do so after the last record in the _NULL_ datastep has been processed. I can't use the END = INFILE option, because of the subsetting IF's. Does anyone have an alternative that I could employ in this instance?
Thank you very much for your help.
Regards,
Scott
This should give you a push in the right direction.
filename foo url 'http://www.realestate.com.au/rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+ben...
data bar;
length url $ 256 uri $ 128;
if _n_=1 then do;
dcl hash ha();
dcl hiter hi('ha');
ha.definekey( 'uri' );
ha.definedata( 'url' );
ha.definedone();
end;
infile foo length=len lrecl=200000 scanover dlm='''' eof=done;
input @'<div class=''media''><a href=''/' uri : @;
url = catx( '/' , 'http://www.realestate.com.au' , uri );
ha.ref();
return;
done: do;
put 'done';
length c $ 32767;
call missing(c);
drop rc prx1-prx6 fid pos len c;
rc=hi.first();
array prx[6];
array r[6] $ 32 price propertytype bedrooms bathroons bond garage;
prx1=prxparse('#<p class=''price''><span class="hidden">(\$\d+\s[^<]+)</span></p>#');
prx2=prxparse('#<span class=''propertyType''>([^<]+)</span>#');
prx3=prxparse('#<li>Bedrooms:<span>(\d+)</span></li>#');
prx4=prxparse('#<li>Bathrooms:<span>(\d+)</span></li>#');
prx5=prxparse('#<li>Bond:<span>([^<]+)</span></li>#');
prx6=prxparse('#<li>Garage\sSpaces:<span>(\d+)</span></li>#');
do while(rc=0);
rc=filename('bar',url,'url');
if (rc ne 0) then _error_=1;
fid=fopen('bar','i',32767,'v');
if (fid eq 0) then _error_=1;
do while(fread(fid)=0);
rc=fget(fid,c,32767);
do _n_=1 to 6;
if prxmatch(prx[_n_],c) gt 1 then do;
call prxposn(prx[_n_],1,pos,len);
r[_n_]=substr(c,pos,len);
end;
end;
end;
output;
rc=fclose(fid);
rc=filename('bar');
rc=hi.next();
end;
end;
run;
filename foo clear;
Yes, the natural escape to substrings and scans and conditioning is the use of Regular Expressions via the PRXsomething functions and call routines. Or, if your website target offers webservices, you could use soap and read the file as xml and use a custom map/soap request to achieve the results
I can't quite help you with regex syntax unless you provide either your targetted strings/substrings or somehow provide the logic of your chain of substr functions. But typically, you can create a single regex that can achieve all of your conditions for the substr chain and only extract one piece at a time based on the PRX results using a single substr statement, looping on your text using PRXNEXT
However, if your web pages always have exactly the same format, it might be best to preprocess the data. Like there should be some obvious markup tags that distinguish where your relevant data is found and you'd likely save a lot of time using the markup languages strenght. Even moreso if the URL respects xml html in which case you can probably just strip a header line off of your URL file and read the whole thing as an XML with the XML engine and go directly to the tag of interest with a custom map.
Long story short, many different options are availible depending on how your target website is setup. Using the power of xml is the one that will cause the fewer headaches like having to debug the regex or the substring chain logic but not all websites are using xml standards.
This should give you a push in the right direction.
filename foo url 'http://www.realestate.com.au/rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+ben...
data bar;
length url $ 256 uri $ 128;
if _n_=1 then do;
dcl hash ha();
dcl hiter hi('ha');
ha.definekey( 'uri' );
ha.definedata( 'url' );
ha.definedone();
end;
infile foo length=len lrecl=200000 scanover dlm='''' eof=done;
input @'<div class=''media''><a href=''/' uri : @;
url = catx( '/' , 'http://www.realestate.com.au' , uri );
ha.ref();
return;
done: do;
put 'done';
length c $ 32767;
call missing(c);
drop rc prx1-prx6 fid pos len c;
rc=hi.first();
array prx[6];
array r[6] $ 32 price propertytype bedrooms bathroons bond garage;
prx1=prxparse('#<p class=''price''><span class="hidden">(\$\d+\s[^<]+)</span></p>#');
prx2=prxparse('#<span class=''propertyType''>([^<]+)</span>#');
prx3=prxparse('#<li>Bedrooms:<span>(\d+)</span></li>#');
prx4=prxparse('#<li>Bathrooms:<span>(\d+)</span></li>#');
prx5=prxparse('#<li>Bond:<span>([^<]+)</span></li>#');
prx6=prxparse('#<li>Garage\sSpaces:<span>(\d+)</span></li>#');
do while(rc=0);
rc=filename('bar',url,'url');
if (rc ne 0) then _error_=1;
fid=fopen('bar','i',32767,'v');
if (fid eq 0) then _error_=1;
do while(fread(fid)=0);
rc=fget(fid,c,32767);
do _n_=1 to 6;
if prxmatch(prx[_n_],c) gt 1 then do;
call prxposn(prx[_n_],1,pos,len);
r[_n_]=substr(c,pos,len);
end;
end;
end;
output;
rc=fclose(fid);
rc=filename('bar');
rc=hi.next();
end;
end;
run;
filename foo clear;
That is an outstanding solution. Thank you very much for taking the time to code it.
I understand everything in the code except for the need to define the hash object as a hash iterator. Can I ask why this is the case?
Thank you again for your help and for Vince's suggestion.
The reason is fairly simple. The hash is used to collect a unique list of url's to scrape, the iterator helps me loop over the list as part of the done action (the first and next methods on the hi object).
Hi FriedEgg,
Thank you for that additional information.
Another question if I could.
Before going to each of the individual search results I would like to compile a list of every property returned by the search itself (currently this would be carried out on a page by page basis) by using the value associated with the next button, I assume I could do this using an iterative do loop in this instance, but I would prefer to use the next buttons value in case I have a future requirement to do so.
Sorry to be a pain. I rarely use txt files, so am using this scenario to build my skills.
Thanks again for your help.
Regards,
Scott
data _null_;
length url $ 280 c $ 32767;
declare hash ha();
ha.definekey('url');
ha.definedata('url');
ha.definedone();
ip='www.realestate.com.au';
path='rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+benowa%2c+qld+4217%3b+benowa+waters%2c+qld+4217';
options='includeSurrounding=false';
prx=prxparse('#(?<=''/)(property-\w+-[^'']+)(?='')#');
prxnext=prxparse('#(?<=<p>)(Showing\s\d+\s-\s(\d+)\sof\s(\d+)?\sTotal\sResults)(?=</p>)#');
do page=1 by 1 until(done);
search_path=cats( 'http://' , ip , '/' , path , '/' , 'list-' , page , '?' , options );
rc=filename( 'foo' , search_path , 'url' );
if ( rc ne 0 ) then link err;
fid=fopen( 'foo' , 'i' , 32767 , 'v' );
if ( fid eq 0 ) then link err;
do while ( fread( fid ) = 0 );
rc=fget(fid,c,32767);
if prxmatch(prxnext,c) gt 1 then do;
if prxparen(prxnext) eq 3 then do;
x=input(prxposn(prxnext,2,c),best.);
y=input(prxposn(prxnext,3,c),best.);
if x=y then done=1;
end;
end;
if prxmatch(prx,c) gt 1 then do;
call prxposn(prx,1,pos,len);
url=cats( 'http://' , ip , '/' , substr(c,pos,len) );
ha.ref();
end;
end;
rc=fclose(fid);
rc=filename('foo');
end;
rc=ha.output(dataset:'foo');
stop;
err: do;
_error_=1; stop;
end;
run;
I also can do it simpler.
FILENAME FOO URL 'http://www.realestate.com.au/rent/in-southport%2c+qld+4215%3b+labrador%2c+qld+4215%3b+bundall%3b+benowa%2c+qld+4217%3b+benowa+waters%2c+qld+4217/list-1?includeSurrounding=false'; data x; infile foo lrecl=34556727 dlm='<>"='; input x : $100. @@; run; data y; set x; if lag3(x) in ('Bedrooms' 'Bathrooms' ) or (lag2(x) ="'media'" and lag(x)="a href" ) or lag(x) in ('priceText' "'propertyType'" ); run; data z; set y; if index(x,"'") then do; n+1;x=cats( 'http://www.realestate.com.au',dequote(x) );end; run; data temp; set z; by n; if not (first.n and last.n) and n ne 0; run;
Xia Keshan
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.