BookmarkSubscribeRSS Feed
kevin12
Fluorite | Level 6
I been working on the program. I do have issues using other websites though. I am currently lost to what to do so to use the other websites. Example of the other websites are
Here is the code.
 
filename LAW url "http://delcode.delaware.gov/sessionlaws/ga148/";*website name go here;
filename chptrs TEMP;*You can remove temp. It make "chptrs" temporary until it is closed;
 
%macro chapter(url=);
      filename LAW url "&url"; *pulling in the website from above into a macro;
 
      data %scan(&url, -2, %str(/.))(keep=xx found);*scan search for the words you want.
      String is the delimiters for all the webpage. The "-2" is to show how many strings to go back.;
     
      *Do not change these numbers;
            length found $40;
            infile law length=len lrecl=32767;
            input x $varying32767. len;
            retain flag;
 
            if (_n_=1) then
            do;
                  RETAIN patternID;
                  *You can add or subtract the names you want it to search for in each page below;
                  patternID=prxparse('/(DISTRICT|SCHOOL|PUBLIC|BOARD|DRAINAGE)/i');
                  put patternID=;
            end;
           
            if x=: '<body>' then
                  flag=1; *if any of the words are found, the webpage is flag;
 
            if flag then
            do;
                  xx=prxchange('s/\<[^\<\>]+\>//', -1, x); * this remove the different signs in the body of the text;
 
                  if not prxmatch('/^\s+$/', xx) then
                  do;
                        start=1;
                        stop=length(xx);
                        *put xx=;
                        /* Search for one of the target words */
                        CALL PRXNEXT(patternID , start, stop, xx, position, length);
 
                        do while (position > 0);
                              found=substr(xx, position, length);*It pulls part of the statement out for the
                              words that was found;
                              PUT found=start=stop=position=length=;
                              output;
                              CALL PRXNEXT(patternID , start, stop, xx, position, length);
                        end;
                  end;
 
            end;
      run;
 
%mend;
 
/* Extract all Chapter links */
data have(keep=xx) _urls(keep=uri);
      length uri $300 ;
      infile law length=len lrecl=32767;
      input x $varying32767. len;
      retain flag ;
      FILE chptrs lrecl=400;
 
      /* Holds derived FILENAME Statements */
      if x=: '<body>' then
            flag=1;
 
      if flag then
            do;
                  xx=prxchange('s/\<[^\<\>]+\>//', -1, x);
 
                  if not prxmatch('/^\s+$/', xx) then
                        output have;
 
                  if prxmatch('/CHAPTER\s+\d+/i', x) then
                        do;
                              temp=scan(x, 2, '"');*the 2 tells how man delimiter they go up to working left to rightt.
                              If it is negative its from right to left (blank ! $ % & ( ) * + , – . /  <|);
 
                              if not missing(temp) then
                                    do;
                                          uri=cats('http://delcode.delaware.gov/sessionlaws/ga148/', temp);
                                                *cats remove spacing before and after;
                                          /* Write out the Filename Statement */
                                          PUT '%chapter(url=%str(' uri '));';
                                          output _urls;
                                    end;
                        end;
            end;
run;
 
options source source2 mprint;
%include chptrs / lrecl=400;
 
filename _all_ clear;
 

Thanks for your help,

4 REPLIES 4
ballardw
Super User

You may need to be a bit more detailed with your question.

Does the program example runs for the given site and get the expected, or at least useable, data?

Are you asking on how to modify this program to access other sites? with different keywords?

Please not that attempting to read PDF files is likely to be a less-than-joyous experience. So are you attempting to down load PDFs? Or Html?

 

Your first other URL shows a page that implies it is expecting some kind of query, so you likley need to change the URL but I a have no clue to what.

kevin12
Fluorite | Level 6

The program works for the given websitethat is currently in the program. After running the program, it can be seen that each chapter comes up with the word that one is looking for. I would want to access other websites with the same key words. The key words can be changed but the problem is whenever the websites are change it does not work as well. It is mainly html. I already have a program that scan PDFs. 

ChrisHemedinger
Community Manager

I've published some general guidance about scraping data from web pages with SAS in this blog post.

 

While your program is good and works well with the one style of page that you designed it for, it's a big challenge to build something that works for every web site out there.  The diversity of web pages and how they are produced (HTML, Javascript, DIV tags vs TABLE tags, etc.) is immense.

 

Others have written papers on the topic:

SAS Text Miner (as @Patrick mentioned) has a built-in capability for crawling web sites with the %TMFILTER macro - and is designed to be more robust, with safeguards for performance and web-crawling etiquette.

It's time to register for SAS Innovate! Join your SAS user peers in Las Vegas on April 16-19 2024.
Patrick
Opal | Level 21

Ideally you'd have the SAS Text Analytics bundle licensed as this would give you everything you need (and more).

 

I'm sure there are ways to do everything in Foundation SAS (eventually with the help of calling some 3rd party tools out of SAS like Tika) but I'd assume it's going to cost you a lot of effort to get it right and every change to your sources will cause you a lot of additional work.

 

If you don't have access to SAS Text Analytics or at least some of it's sub-components like Web Crawler then consider to look into using Python for at least the data retrieval and data prep part of your task.

 

Python is an open source programming environment which integrates quite well with SAS (and it will integrate even better in future releases).

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 2204 views
  • 2 likes
  • 4 in conversation