Hi,
I have a column that contains sentences in where I want to extract a specific keyword from and create a new column containing the keyword. Please see below:
Sentence | Keyword Desired |
Firm A helps Midea to successfully complete its takeover of KUKA | Midea |
Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel engines | Volkswagen |
Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech | The Carlyle Group |
The sentences are written quite similar and most of the keywords I want fall immediately after "advises" or "helps". Is there are a way of obtaining the keywords automatically after those identifiers?
Many Thanks
Chris
Something like below could work.
data have;
infile cards truncover;
input have_str $200.;
datalines;
Firm A helps Midea to successfully complete its takeover of KUKA
Firm A advised on the Volkswagen settlement with US authorities regarding emissions from diesel engines
Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel engines Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
;
run;
data want(drop=_:);
set have;
length want $100;
retain _prxid;
if _n_=1 then
_prxid=prxparse('/(\bhelp|\badvise)[^A-Z]{0,10}(([A-Z]\w+\s*\b)+)/');
_start=1;
_stop=length(have_str);
call prxnext(_prxid, _start, _stop, have_str, _pos, _len);
do while (_pos > 0);
want=catx('|',want,prxposn(_prxid, 2, have_str));
call prxnext(_prxid, _start, _stop, have_str, _pos, _len);
end;
run;
proc print data=want;
run;
There is, the question is going to be how to know when to stop taking words after helps/advises. Anyway to start:
data want; set have; if index(sentance,"helps") > 0 then after=substr(sentance,index(sentance,"helps") + 6); run;
This will take everything in the string after the word helps. You can do the same for advises. But I don't see any logical way of knowing if there should be 1,2,3, etc words after helps to be saved?
Are you lucky enough that
Then you could find the location of "Firm A" in your sentence, and then scan the remainder looking for words with initial letters capitalized:
data want (drop=_:);
set have;
_ix = indexw(sentence,"Firm A"); /* Find "Firm A" in the sentence */
_mod_string=substr(sentence,_ix+length("Firm A")); /* Get remainder of the sentence */
length key_word $30;
/* Extract each word and see if it starts with a capital letter */
/* Once non-capitalized word is encountered and key_word is not blank, leave the loop */
do _w=1 to countw(_mod_string);
_test_word=scan(_mod_string,_w,'');
if char(_test_word,1)=upcase(char(_test_word,1)) then key_word=catx(' ',key_word,_test_word);
else if key_word^=' ' then leave;
end;
run;
If you have a list of companies of interest and all the likely permutations of the spelling of the names you may be better off using a search to indicate found words.
Something that is unreliable but may help with some of the multi-word names is to keep sequential words (append in a single variable) that start with uppercase letters. All those names with lower case words common in names such as "of" "and" would fail though you might get enough
How many records are you going to be searching?
There are around 40k records that I would be screening
Here can give you a start.
data have;
infile cards truncover;
input x $200.;
cards;
Firm A helps Midea to successfully complete its takeover of KUKA
Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel engines
Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
;
run;
data want;
set have;
temp=substr(x,prxmatch('/helps|advises/i',x));
want=catx(' ',scan(temp,2),scan(temp,3),scan(temp,4));
drop temp;
run;
The following code uses a Regular Expression which captures the first set of words starting with a capital after terms help or advise.
data have;
infile cards truncover;
input have_str $200.;
cards;
Firm A helps Midea to successfully complete its takeover of KUKA
Firm A advised on the Volkswagen settlement with US authorities regarding emissions from diesel engines
Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
;
run;
data want;
set have;
length want $100;
retain _prxid;
if _n_=1 then
_prxid=prxparse('/(\bhelp|\badvise)[^A-Z]{0,10}(([A-Z]\w+\s*\b)+)/');
if prxmatch(_prxid, have_str) then
do;
want=prxposn(_prxid, 2, have_str);
end;;
run;
proc print data=want;
run;
Hi,
If the raw data looked like the below;
"Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel engines Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech"
how would you create two want values and pick up both Volkswagen & The Carlyle Group?
Thanks
Chris
Is it always Firm A? Is this a term we can use as keyword for a search?
Hi,
No it's not always Firm A. The text itself could start with one of the 3 keywords such as advised. I used Firm A to start the sentence and is therefore not a keyword in the search.
Thanks
Chris
Something like below could work.
data have;
infile cards truncover;
input have_str $200.;
datalines;
Firm A helps Midea to successfully complete its takeover of KUKA
Firm A advised on the Volkswagen settlement with US authorities regarding emissions from diesel engines
Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel engines Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
;
run;
data want(drop=_:);
set have;
length want $100;
retain _prxid;
if _n_=1 then
_prxid=prxparse('/(\bhelp|\badvise)[^A-Z]{0,10}(([A-Z]\w+\s*\b)+)/');
_start=1;
_stop=length(have_str);
call prxnext(_prxid, _start, _stop, have_str, _pos, _len);
do while (_pos > 0);
want=catx('|',want,prxposn(_prxid, 2, have_str));
call prxnext(_prxid, _start, _stop, have_str, _pos, _len);
end;
run;
proc print data=want;
run;
Or:
data WANT;
set HAVE;
KEYWORD=prxchange('s/.*?\b(help|advise)[^A-Z]{0,10}((\b[A-Z]\S+ )+).*/$2/o',1,TXT);
put KEYWORD=;
run;
KEYWORD=Midea
KEYWORD=Volkswagen
KEYWORD=The Carlyle Group
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.