BookmarkSubscribeRSS Feed
kaziumair
Quartz | Level 8

Hi everyone ,

I want to extract names of people from an article . Some of the names start with a title and some do not.

I am using prxparse and prxnext to find the names and I am partially successfully in the task as well , however in addition to the names other text matching the pattern are also being extracted which was expected . Can you please suggest a way to find only the names with or without the title? 

In the code below I am trying to find names without any title.

filename source "location/source.txt";

proc http
	method="get"
	url="https://amabhungane.org/stories/210701-vbs-indictment-details-corrupt-gratifications-driving-illegal-municipal-investments-in-the-doomed-bank/"
	out=source;
run;
data work.rep(drop=line linenum rx1);
infile source length=len lrecl=32767;
input line $varying32767. len;
 line = strip(line);
 linenum=_n_;
 retain rx1;
 rx1=prxparse("s/<.*?>//");
 if len>0;
 string = line;
 if find(line,'<p>') gt 0 then do;
 	call prxchange(rx1,-1,string);
	output;
 end;
run;
proc transpose data=work.rep out=work.rep_t;
	var string;
run;
data work.extracted_para;
	length paragraph $ 5000 ;
	set work.rep_t;
	paragraph = catx(". ", col,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20);
	keep paragraph;
run;
data extract;
	set work.extracted_para;
	start_pos=1;
	stop_pos=length(paragraph);
	pattern_pos = prxparse("/ [A-Z]{1}\w+\s[A-Z]{1}\w+ /");
	call prxnext(pattern_pos, start_pos, stop_pos, paragraph, position, length);
      do while (position > 0);
        name=substr(paragraph, position, length);
 		output;
         call prxnext(pattern_pos, start_pos, stop_pos,paragraph, position, length);
      end;
run; 

 

2 REPLIES 2
Reeza
Super User
How are names uniquely identified in your text? Can you include some sample data?

If there's no way to differentiate between someone referring to John or john or Apple/Blue being valid names you're going to have margins of error.

I find Google API's relatively good at this. Do you have access to SAS EM with the text capabilities?
kaziumair
Quartz | Level 8
Hi, in the article the names are in proper case, example - David Beckham . There are some names which are preceeded by titles for example minister, president, officer, etc.
I do not have access to SAS EM, but I do have access to SAS Viya .

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 572 views
  • 0 likes
  • 2 in conversation