BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kaziumair
Quartz | Level 8

Hi , everyone I am trying to extract article titles from the following URL . There are a total of 20 titles in this page . I am using proc http to fetch this page . While running the data step to extract the titles only 4/20 are getting extracted . After checking the text file which I have referenced in proc http I found that while importing the text file its contents are being truncated . This might be due to the maximum length , I am not sure . Please suggest a way to overcome this problem and extract all the titles.

 %let url = %nrstr(https://www.businesslive.co.za/bd/politics/?limit=10&partial=true&page=1);
filename dest "location/extract.txt";
proc http
	url = "&url."
	out = dest
	method = "GET" ;
run;

data links;
INFILE dest LENGTH = recLen;
INPUT line $VARYING32767. recLen;
line = strip(line);
len=length(line);
start_pos=1;
stop_pos=length(line);
pattern_pos=prxparse("/title=/");
	call prxnext(pattern_pos, start_pos, stop_pos, line, position, length);
      do while (position > 0);
        titles=compress(scan(substr(line, position),2,'"'),'\');
 		output;
        call prxnext(pattern_pos, start_pos, stop_pos,line, position, length);
      end;
      keep titles;
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Are you trying to parse HTML code?

 

A SAS character variable can only hold 32K bytes.  So trying to pull the text into a variable and parse it with REGEX is not the way to do.  If you want to use REGEX do it with some external process and than have SAS read those results.

 

If you want to do it with SAS then use the functionality of the SAS INPUT statement to help.

Perhaps something like:


%let url = 'https://www.businesslive.co.za/bd/politics/?limit=10&partial=true&page=1';
filename dest temp;
proc http
  url = &url
  out = dest
  method = "GET"
;
run;

data test ;
  infile dest lrecl=9000000 ;
  linkno+1;
  input @'<a class=article-title' @'href=' title & :$300. @@;
  title=dequote(tranwrd(title,'\"','"'));
run;

proc print;
run;

Results:

Obs    linkno    title

  1       1      /bd/politics/2021-08-06-podcast-unpacking-the-reshuffle-and-what-is-means-for-sa-and-its-economy/
  2       2      /bd/politics/2021-08-06-parties-and-unions-in-mixed-reaction-to-ramaphosas-reshuffle/
  3       3      /bd/politics/2021-08-03-exclusive-no-ministers-are-safe-as-ramaphosa-prepares-to-wield-axe/
  4       4      /bd/politics/2021-08-03-new-contenders-emerge-for-posts-in-reshuffled-cabinet/
  5       5      /bd/national/2021-08-01-anc-vows-to-go-it-alone-after-disagreeing-with-eff-on-land-expropriation/
  6       6      /bd/politics/2021-08-01-political-week-ahead-da-calls-for-transparent-investigation-on-unrest/
  7       7      /bd/politics/2021-07-25-political-week-ahead-unrest-fallout-to-remain-top-of-agenda/
  8       8      /bd/politics/2021-07-18-political-week-ahead-jacob-zumas-court-sagas-likely-to-dominate-the-agenda/
  9       9      /bd/politics/2021-07-16-watch-busa-calls-for-24-hour-curfew/
 10      10      /bd/politics/2021-07-15-watch-who-or-what-is-behind-the-riots/
 11      11      /bd/politics/2021-07-15-watch-how-actionsa-plans-to-file-a-lawsuit-against-cabinet-and-anc/
 12      12      /bd/politics/2021-07-14-watch-unpacking-the-powder-keg/
 13      13      /bd/politics/2021-07-11-political-week-ahead-jacob-zuma-back-at-top-court-to-appeal-contempt-charges/
 14      14      /bd/national/2021-07-09-its-politics-versus-the-law-magashule-says-after-high-court-defeat/
 15      15      /bd/politics/2021-07-07-jacob-zuma-will-not-go-to-jail-on-wednesday-night-son-edward-says/
 16      16      /bd/politics/2021-07-07-sa-electoral-reform-could-lead-to-a-bigger-anc/
 17      17      /bd/politics/2021-07-06-anc-distances-itself-from-zumas-attacks-on-court/
 18      18      /bd/politics/2021-07-04-political-week-ahead-zuma-drama-takes-centre-stage/
 19      19      /bd/national/2021-07-02-anc-scraps-nec-meeting-over-zumas-surrender-and-feared-violence-in-kzn/
 20      20      /bd/opinion/columnists/2021-06-28-carol-paton-prospects-of-anc-eff-truce-over-land-fade-but-problems-remain/

View solution in original post

6 REPLIES 6
ballardw
Super User

@kaziumair wrote:

Hi , everyone I am trying to extract article titles from the following URL . There are a total of 20 titles in this page . I am using proc http to fetch this page . While running the data step to extract the titles only 4/20 are getting extracted . After checking the text file which I have referenced in proc http I found that while importing the text file its contents are being truncated . This might be due to the maximum length , I am not sure . Please suggest a way to overcome this problem and extract all the titles.

 %let url = %nrstr(https://www.businesslive.co.za/bd/politics/?limit=10&partial=true&page=1);
filename dest "location/extract.txt";
proc http
	url = "&url."
	out = dest
	method = "GET" ;
run;

data links;
INFILE dest LENGTH = recLen;
INPUT line $VARYING32767. recLen;
line = strip(line);
len=length(line);
start_pos=1;
stop_pos=length(line);
pattern_pos=prxparse("/title=/");
	call prxnext(pattern_pos, start_pos, stop_pos, line, position, length);
      do while (position > 0);
        titles=compress(scan(substr(line, position),2,'"'),'\');
 		output;
        call prxnext(pattern_pos, start_pos, stop_pos,line, position, length);
      end;
      keep titles;
run;

 


Attempting to run your code (I move the URL to the Proc to avoid any macro resolution issues) I get this in the Proc HTTP

58   proc http
59      url = "https://www.businesslive.co.za/bd/politics/?limit=10&partial=true&page=1"
WARNING: Apparent symbolic reference PARTIAL not resolved.
WARNING: Apparent symbolic reference PAGE not resolved.
60      out = dest
61      method = "GET" ;
62   run;

When I removed the Partial and page I got a lot more text.

99   proc http
100     url = "https://www.businesslive.co.za/bd/politics/?limit=10"
101     out = dest
102     method = "GET" ;
103  run;

Your logic pulled out 23 "titles". I suspect they aren't all the ones that you want as some look like click bait from the website.

 

Another consideration when it comes truncation is that without setting a specific length for the Titles variable the first use will set that length and may not be the one you want. The initial program wanted to set the Titles to 32K characters, which is bit excessive.

kaziumair
Quartz | Level 8
Hi ,actually the website has many pages and I have to scrape the entire website, so I require pagination, in order to access all the pages. I have used %nrstr function, so there was no problem in macro resolution. Using the link which you have used I do not get the expected results, there are a lot of duplicates.
ballardw
Super User

@kaziumair wrote:
Hi ,actually the website has many pages and I have to scrape the entire website, so I require pagination, in order to access all the pages. I have used %nrstr function, so there was no problem in macro resolution. Using the link which you have used I do not get the expected results, there are a lot of duplicates.

The Warning I mentioned about Page means that is likely not being used by Proc HTTP so not sure what to do about that.

I went to the website and could not decide which "titles" you wanted. I suspected a subset but the way your code parses the values is why there would be duplicates.

The initial question was related to "the text file its contents are being truncated". I did my best to provide a solution to that, i.e. removing elements that Proc HTTP does not understand. Perhaps you want to revisit the text file and your code to parse it. Maybe there is something else to identify the title= that you don't want.

kaziumair
Quartz | Level 8

Hi , thank you for taking out time to help me solve my query

Tom
Super User Tom
Super User

Are you trying to parse HTML code?

 

A SAS character variable can only hold 32K bytes.  So trying to pull the text into a variable and parse it with REGEX is not the way to do.  If you want to use REGEX do it with some external process and than have SAS read those results.

 

If you want to do it with SAS then use the functionality of the SAS INPUT statement to help.

Perhaps something like:


%let url = 'https://www.businesslive.co.za/bd/politics/?limit=10&partial=true&page=1';
filename dest temp;
proc http
  url = &url
  out = dest
  method = "GET"
;
run;

data test ;
  infile dest lrecl=9000000 ;
  linkno+1;
  input @'<a class=article-title' @'href=' title & :$300. @@;
  title=dequote(tranwrd(title,'\"','"'));
run;

proc print;
run;

Results:

Obs    linkno    title

  1       1      /bd/politics/2021-08-06-podcast-unpacking-the-reshuffle-and-what-is-means-for-sa-and-its-economy/
  2       2      /bd/politics/2021-08-06-parties-and-unions-in-mixed-reaction-to-ramaphosas-reshuffle/
  3       3      /bd/politics/2021-08-03-exclusive-no-ministers-are-safe-as-ramaphosa-prepares-to-wield-axe/
  4       4      /bd/politics/2021-08-03-new-contenders-emerge-for-posts-in-reshuffled-cabinet/
  5       5      /bd/national/2021-08-01-anc-vows-to-go-it-alone-after-disagreeing-with-eff-on-land-expropriation/
  6       6      /bd/politics/2021-08-01-political-week-ahead-da-calls-for-transparent-investigation-on-unrest/
  7       7      /bd/politics/2021-07-25-political-week-ahead-unrest-fallout-to-remain-top-of-agenda/
  8       8      /bd/politics/2021-07-18-political-week-ahead-jacob-zumas-court-sagas-likely-to-dominate-the-agenda/
  9       9      /bd/politics/2021-07-16-watch-busa-calls-for-24-hour-curfew/
 10      10      /bd/politics/2021-07-15-watch-who-or-what-is-behind-the-riots/
 11      11      /bd/politics/2021-07-15-watch-how-actionsa-plans-to-file-a-lawsuit-against-cabinet-and-anc/
 12      12      /bd/politics/2021-07-14-watch-unpacking-the-powder-keg/
 13      13      /bd/politics/2021-07-11-political-week-ahead-jacob-zuma-back-at-top-court-to-appeal-contempt-charges/
 14      14      /bd/national/2021-07-09-its-politics-versus-the-law-magashule-says-after-high-court-defeat/
 15      15      /bd/politics/2021-07-07-jacob-zuma-will-not-go-to-jail-on-wednesday-night-son-edward-says/
 16      16      /bd/politics/2021-07-07-sa-electoral-reform-could-lead-to-a-bigger-anc/
 17      17      /bd/politics/2021-07-06-anc-distances-itself-from-zumas-attacks-on-court/
 18      18      /bd/politics/2021-07-04-political-week-ahead-zuma-drama-takes-centre-stage/
 19      19      /bd/national/2021-07-02-anc-scraps-nec-meeting-over-zumas-surrender-and-feared-violence-in-kzn/
 20      20      /bd/opinion/columnists/2021-06-28-carol-paton-prospects-of-anc-eff-truce-over-land-fade-but-problems-remain/

kaziumair
Quartz | Level 8

Hi , yes , I am trying to parse HTML code . Thank you for your help , this is exactly what I wanted. 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 754 views
  • 0 likes
  • 3 in conversation