Restricting keywords

Accepted Solution Solved
Reply
Contributor
Posts: 32
Accepted Solution

Restricting keywords

Hi,

 

I have a column that contains sentences in where I want to extract a specific keyword from and create a new column containing the keyword. Please see below:

 

SentenceKeyword Desired
Firm A helps Midea to successfully complete its takeover of KUKAMidea
Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel enginesVolkswagen
Firm A advises The Carlyle Group on its $3.2bn acquisition of AtotechThe Carlyle Group

 

The sentences are written quite similar and most of the keywords I want fall immediately after "advises" or "helps". Is there are a way of obtaining the keywords automatically after those identifiers?

 

Many Thanks

Chris


Accepted Solutions
Solution
‎07-19-2017 06:59 AM
Respected Advisor
Posts: 3,893

Re: Restricting keywords

@cmoore

Something like below could work.

data have;
  infile cards truncover;
  input have_str $200.;
  datalines;
Firm A helps Midea to successfully complete its takeover of KUKA  
Firm A advised on the Volkswagen settlement with US authorities regarding emissions from diesel engines 
Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel engines Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
;
run;

data want(drop=_:);
  set have;
  length want $100;
  retain _prxid;

  if _n_=1 then
    _prxid=prxparse('/(\bhelp|\badvise)[^A-Z]{0,10}(([A-Z]\w+\s*\b)+)/');
  _start=1;
  _stop=length(have_str);

  call prxnext(_prxid, _start, _stop, have_str, _pos, _len);
  do while (_pos > 0);
    want=catx('|',want,prxposn(_prxid, 2, have_str));
    call prxnext(_prxid, _start, _stop, have_str, _pos, _len);
  end;

run;

proc print data=want;
run;

View solution in original post


All Replies
Super User
Super User
Posts: 7,401

Re: Restricting keywords

There is, the question is going to be how to know when to stop taking words after helps/advises.   Anyway to start:

data want;
  set have;
  if index(sentance,"helps") > 0 then after=substr(sentance,index(sentance,"helps") + 6);
run;

This will take everything in the string after the word helps.  You can do the same for advises.  But I don't see any logical way of knowing if there should be 1,2,3, etc words after helps to be saved?

Valued Guide
Posts: 797

Re: Restricting keywords

Are you lucky enough that

  1. All your sentences have the term "Firm A" preceding (by some number of words) the desired keyword(s)?
  2. The desired keyword(s) will always be the first set of words (after "Firm A") with initial letters capitalized?

 

 Then you could find the location of "Firm A" in your sentence, and then scan the remainder looking for words with initial letters capitalized:

data want (drop=_:);
  set have;
  _ix = indexw(sentence,"Firm A");                   /* Find "Firm A" in the sentence */
  _mod_string=substr(sentence,_ix+length("Firm A")); /* Get remainder of the sentence */

  length key_word $30;
  /* Extract each word and see if it starts with a capital letter */
  /* Once non-capitalized word is encountered and key_word is not blank, leave the loop */
  do _w=1 to countw(_mod_string);
    _test_word=scan(_mod_string,_w,'');  
	if char(_test_word,1)=upcase(char(_test_word,1)) then key_word=catx(' ',key_word,_test_word);
	else if key_word^=' ' then leave;
  end;    
run;

 

 

Super User
Posts: 10,500

Re: Restricting keywords

If you have a list of companies of interest and all the likely permutations of the spelling of the names you may be better off using a search to indicate found words.

 

Something that is unreliable but may help with some of the multi-word names is to keep sequential words (append in a single variable) that start with uppercase letters. All those names with lower case words common in names such as "of" "and" would fail though you might get enough

 

How many records are you going to be searching?

Contributor
Posts: 32

Re: Restricting keywords

There are around 40k records that I would be screening

Super User
Posts: 9,681

Re: Restricting keywords

Here can give you a start.

 

data have;
infile cards truncover;
input x $200.;
cards;
Firm A helps Midea to successfully complete its takeover of KUKA	
Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel engines	
Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
;
run;

data want;
 set have;
 temp=substr(x,prxmatch('/helps|advises/i',x));
 want=catx(' ',scan(temp,2),scan(temp,3),scan(temp,4));
 drop temp;
run;
Respected Advisor
Posts: 3,893

Re: Restricting keywords

[ Edited ]

The following code uses a Regular Expression which captures the first set of words starting with a capital after terms help or advise.

data have;
  infile cards truncover;
  input have_str $200.;
  cards;
Firm A helps Midea to successfully complete its takeover of KUKA  
Firm A advised on the Volkswagen settlement with US authorities regarding emissions from diesel engines 
Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
;
run;

data want;
  set have;
  length want $100;
  retain _prxid;

  if _n_=1 then
    _prxid=prxparse('/(\bhelp|\badvise)[^A-Z]{0,10}(([A-Z]\w+\s*\b)+)/');

  if prxmatch(_prxid, have_str) then
    do;
      want=prxposn(_prxid, 2, have_str);
    end;;
run;

proc print data=want;
run;

 

Contributor
Posts: 32

Re: Restricting keywords

Hi,

 

If the raw data looked like the below;

 

"Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel engines Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech"

 

how would you create two want values and pick up both Volkswagen &  The Carlyle Group?

 

Thanks

Chris

 

Respected Advisor
Posts: 3,893

Re: Restricting keywords

@cmoore

Is it always Firm A? Is this a term we can use as keyword for a search?

Contributor
Posts: 32

Re: Restricting keywords

Hi,

 

No it's not always Firm A. The text itself could start with one of the 3 keywords such as advised. I used Firm A to start the sentence and is therefore not a keyword in the search.

 

Thanks

Chris

Solution
‎07-19-2017 06:59 AM
Respected Advisor
Posts: 3,893

Re: Restricting keywords

@cmoore

Something like below could work.

data have;
  infile cards truncover;
  input have_str $200.;
  datalines;
Firm A helps Midea to successfully complete its takeover of KUKA  
Firm A advised on the Volkswagen settlement with US authorities regarding emissions from diesel engines 
Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel engines Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
;
run;

data want(drop=_:);
  set have;
  length want $100;
  retain _prxid;

  if _n_=1 then
    _prxid=prxparse('/(\bhelp|\badvise)[^A-Z]{0,10}(([A-Z]\w+\s*\b)+)/');
  _start=1;
  _stop=length(have_str);

  call prxnext(_prxid, _start, _stop, have_str, _pos, _len);
  do while (_pos > 0);
    want=catx('|',want,prxposn(_prxid, 2, have_str));
    call prxnext(_prxid, _start, _stop, have_str, _pos, _len);
  end;

run;

proc print data=want;
run;
PROC Star
Posts: 1,561

Re: Restricting keywords

Or:


data WANT;
  set HAVE;
  KEYWORD=prxchange('s/.*?\b(help|advise)[^A-Z]{0,10}((\b[A-Z]\S+ )+).*/$2/o',1,TXT);
  put KEYWORD=;
run;

KEYWORD=Midea
KEYWORD=Volkswagen
KEYWORD=The Carlyle Group

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 199 views
  • 0 likes
  • 7 in conversation