Webpage scraping using SAS

Reply
New Contributor
Posts: 3

Webpage scraping using SAS

[ Edited ]
I been working on the program. I do have issues using other websites though. I am currently lost to what to do so to use the other websites. Example of the other websites are
Here is the code.
 
filename LAW url "http://delcode.delaware.gov/sessionlaws/ga148/";*website name go here;
filename chptrs TEMP;*You can remove temp. It make "chptrs" temporary until it is closed;
 
%macro chapter(url=);
      filename LAW url "&url"; *pulling in the website from above into a macro;
 
      data %scan(&url, -2, %str(/.))(keep=xx found);*scan search for the words you want.
      String is the delimiters for all the webpage. The "-2" is to show how many strings to go back.;
     
      *Do not change these numbers;
            length found $40;
            infile law length=len lrecl=32767;
            input x $varying32767. len;
            retain flag;
 
            if (_n_=1) then
            do;
                  RETAIN patternID;
                  *You can add or subtract the names you want it to search for in each page below;
                  patternID=prxparse('/(DISTRICT|SCHOOL|PUBLIC|BOARD|DRAINAGE)/i');
                  put patternID=;
            end;
           
            if x=: '<body>' then
                  flag=1; *if any of the words are found, the webpage is flag;
 
            if flag then
            do;
                  xx=prxchange('s/\<[^\<\>]+\>//', -1, x); * this remove the different signs in the body of the text;
 
                  if not prxmatch('/^\s+$/', xx) then
                  do;
                        start=1;
                        stop=length(xx);
                        *put xx=;
                        /* Search for one of the target words */
                        CALL PRXNEXT(patternID , start, stop, xx, position, length);
 
                        do while (position > 0);
                              found=substr(xx, position, length);*It pulls part of the statement out for the
                              words that was found;
                              PUT found=start=stop=position=length=;
                              output;
                              CALL PRXNEXT(patternID , start, stop, xx, position, length);
                        end;
                  end;
 
            end;
      run;
 
%mend;
 
/* Extract all Chapter links */
data have(keep=xx) _urls(keep=uri);
      length uri $300 ;
      infile law length=len lrecl=32767;
      input x $varying32767. len;
      retain flag ;
      FILE chptrs lrecl=400;
 
      /* Holds derived FILENAME Statements */
      if x=: '<body>' then
            flag=1;
 
      if flag then
            do;
                  xx=prxchange('s/\<[^\<\>]+\>//', -1, x);
 
                  if not prxmatch('/^\s+$/', xx) then
                        output have;
 
                  if prxmatch('/CHAPTER\s+\d+/i', x) then
                        do;
                              temp=scan(x, 2, '"');*the 2 tells how man delimiter they go up to working left to rightt.
                              If it is negative its from right to left (blank ! $ % & ( ) * + , – . /  <|);
 
                              if not missing(temp) then
                                    do;
                                          uri=cats('http://delcode.delaware.gov/sessionlaws/ga148/', temp);
                                                *cats remove spacing before and after;
                                          /* Write out the Filename Statement */
                                          PUT '%chapter(url=%str(' uri '));';
                                          output _urls;
                                    end;
                        end;
            end;
run;
 
options source source2 mprint;
%include chptrs / lrecl=400;
 
filename _all_ clear;
 

Thanks for your help,

Super User
Posts: 11,134

Re: Webpage scraping using SAS

You may need to be a bit more detailed with your question.

Does the program example runs for the given site and get the expected, or at least useable, data?

Are you asking on how to modify this program to access other sites? with different keywords?

Please not that attempting to read PDF files is likely to be a less-than-joyous experience. So are you attempting to down load PDFs? Or Html?

 

Your first other URL shows a page that implies it is expecting some kind of query, so you likley need to change the URL but I a have no clue to what.

New Contributor
Posts: 3

Re: Webpage scraping using SAS

The program works for the given websitethat is currently in the program. After running the program, it can be seen that each chapter comes up with the word that one is looking for. I would want to access other websites with the same key words. The key words can be changed but the problem is whenever the websites are change it does not work as well. It is mainly html. I already have a program that scan PDFs. 

Respected Advisor
Posts: 4,135

Re: Webpage scraping using SAS

[ Edited ]

Ideally you'd have the SAS Text Analytics bundle licensed as this would give you everything you need (and more).

 

I'm sure there are ways to do everything in Foundation SAS (eventually with the help of calling some 3rd party tools out of SAS like Tika) but I'd assume it's going to cost you a lot of effort to get it right and every change to your sources will cause you a lot of additional work.

 

If you don't have access to SAS Text Analytics or at least some of it's sub-components like Web Crawler then consider to look into using Python for at least the data retrieval and data prep part of your task.

 

Python is an open source programming environment which integrates quite well with SAS (and it will integrate even better in future releases).

Ask a Question
Discussion stats
  • 3 replies
  • 371 views
  • 0 likes
  • 3 in conversation