Hi everyone ,
I want to extract names of people from an article . Some of the names start with a title and some do not.
I am using prxparse and prxnext to find the names and I am partially successfully in the task as well , however in addition to the names other text matching the pattern are also being extracted which was expected . Can you please suggest a way to find only the names with or without the title?
In the code below I am trying to find names without any title.
filename source "location/source.txt";
proc http
method="get"
url="https://amabhungane.org/stories/210701-vbs-indictment-details-corrupt-gratifications-driving-illegal-municipal-investments-in-the-doomed-bank/"
out=source;
run;
data work.rep(drop=line linenum rx1);
infile source length=len lrecl=32767;
input line $varying32767. len;
line = strip(line);
linenum=_n_;
retain rx1;
rx1=prxparse("s/<.*?>//");
if len>0;
string = line;
if find(line,'<p>') gt 0 then do;
call prxchange(rx1,-1,string);
output;
end;
run;
proc transpose data=work.rep out=work.rep_t;
var string;
run;
data work.extracted_para;
length paragraph $ 5000 ;
set work.rep_t;
paragraph = catx(". ", col,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20);
keep paragraph;
run;
data extract;
set work.extracted_para;
start_pos=1;
stop_pos=length(paragraph);
pattern_pos = prxparse("/ [A-Z]{1}\w+\s[A-Z]{1}\w+ /");
call prxnext(pattern_pos, start_pos, stop_pos, paragraph, position, length);
do while (position > 0);
name=substr(paragraph, position, length);
output;
call prxnext(pattern_pos, start_pos, stop_pos,paragraph, position, length);
end;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.