BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cmoore
Obsidian | Level 7

Hi,

 

I have a column that contains sentences in where I want to extract a specific keyword from and create a new column containing the keyword. Please see below:

 

SentenceKeyword Desired
Firm A helps Midea to successfully complete its takeover of KUKAMidea
Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel enginesVolkswagen
Firm A advises The Carlyle Group on its $3.2bn acquisition of AtotechThe Carlyle Group

 

The sentences are written quite similar and most of the keywords I want fall immediately after "advises" or "helps". Is there are a way of obtaining the keywords automatically after those identifiers?

 

Many Thanks

Chris

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@cmoore

Something like below could work.

data have;
  infile cards truncover;
  input have_str $200.;
  datalines;
Firm A helps Midea to successfully complete its takeover of KUKA  
Firm A advised on the Volkswagen settlement with US authorities regarding emissions from diesel engines 
Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel engines Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
;
run;

data want(drop=_:);
  set have;
  length want $100;
  retain _prxid;

  if _n_=1 then
    _prxid=prxparse('/(\bhelp|\badvise)[^A-Z]{0,10}(([A-Z]\w+\s*\b)+)/');
  _start=1;
  _stop=length(have_str);

  call prxnext(_prxid, _start, _stop, have_str, _pos, _len);
  do while (_pos > 0);
    want=catx('|',want,prxposn(_prxid, 2, have_str));
    call prxnext(_prxid, _start, _stop, have_str, _pos, _len);
  end;

run;

proc print data=want;
run;

View solution in original post

11 REPLIES 11
RW9
Diamond | Level 26 RW9
Diamond | Level 26

There is, the question is going to be how to know when to stop taking words after helps/advises.   Anyway to start:

data want;
  set have;
  if index(sentance,"helps") > 0 then after=substr(sentance,index(sentance,"helps") + 6);
run;

This will take everything in the string after the word helps.  You can do the same for advises.  But I don't see any logical way of knowing if there should be 1,2,3, etc words after helps to be saved?

mkeintz
PROC Star

Are you lucky enough that

  1. All your sentences have the term "Firm A" preceding (by some number of words) the desired keyword(s)?
  2. The desired keyword(s) will always be the first set of words (after "Firm A") with initial letters capitalized?

 

 Then you could find the location of "Firm A" in your sentence, and then scan the remainder looking for words with initial letters capitalized:

data want (drop=_:);
  set have;
  _ix = indexw(sentence,"Firm A");                   /* Find "Firm A" in the sentence */
  _mod_string=substr(sentence,_ix+length("Firm A")); /* Get remainder of the sentence */

  length key_word $30;
  /* Extract each word and see if it starts with a capital letter */
  /* Once non-capitalized word is encountered and key_word is not blank, leave the loop */
  do _w=1 to countw(_mod_string);
    _test_word=scan(_mod_string,_w,'');  
	if char(_test_word,1)=upcase(char(_test_word,1)) then key_word=catx(' ',key_word,_test_word);
	else if key_word^=' ' then leave;
  end;    
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ballardw
Super User

If you have a list of companies of interest and all the likely permutations of the spelling of the names you may be better off using a search to indicate found words.

 

Something that is unreliable but may help with some of the multi-word names is to keep sequential words (append in a single variable) that start with uppercase letters. All those names with lower case words common in names such as "of" "and" would fail though you might get enough

 

How many records are you going to be searching?

cmoore
Obsidian | Level 7

There are around 40k records that I would be screening

Ksharp
Super User

Here can give you a start.

 

data have;
infile cards truncover;
input x $200.;
cards;
Firm A helps Midea to successfully complete its takeover of KUKA	
Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel engines	
Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
;
run;

data want;
 set have;
 temp=substr(x,prxmatch('/helps|advises/i',x));
 want=catx(' ',scan(temp,2),scan(temp,3),scan(temp,4));
 drop temp;
run;
Patrick
Opal | Level 21

The following code uses a Regular Expression which captures the first set of words starting with a capital after terms help or advise.

data have;
  infile cards truncover;
  input have_str $200.;
  cards;
Firm A helps Midea to successfully complete its takeover of KUKA  
Firm A advised on the Volkswagen settlement with US authorities regarding emissions from diesel engines 
Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
;
run;

data want;
  set have;
  length want $100;
  retain _prxid;

  if _n_=1 then
    _prxid=prxparse('/(\bhelp|\badvise)[^A-Z]{0,10}(([A-Z]\w+\s*\b)+)/');

  if prxmatch(_prxid, have_str) then
    do;
      want=prxposn(_prxid, 2, have_str);
    end;;
run;

proc print data=want;
run;

 

cmoore
Obsidian | Level 7

Hi,

 

If the raw data looked like the below;

 

"Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel engines Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech"

 

how would you create two want values and pick up both Volkswagen &  The Carlyle Group?

 

Thanks

Chris

 

Patrick
Opal | Level 21

@cmoore

Is it always Firm A? Is this a term we can use as keyword for a search?

cmoore
Obsidian | Level 7

Hi,

 

No it's not always Firm A. The text itself could start with one of the 3 keywords such as advised. I used Firm A to start the sentence and is therefore not a keyword in the search.

 

Thanks

Chris

Patrick
Opal | Level 21

@cmoore

Something like below could work.

data have;
  infile cards truncover;
  input have_str $200.;
  datalines;
Firm A helps Midea to successfully complete its takeover of KUKA  
Firm A advised on the Volkswagen settlement with US authorities regarding emissions from diesel engines 
Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
Firm A advises on the Volkswagen settlement with US authorities regarding emissions from diesel engines Firm A advises The Carlyle Group on its $3.2bn acquisition of Atotech
;
run;

data want(drop=_:);
  set have;
  length want $100;
  retain _prxid;

  if _n_=1 then
    _prxid=prxparse('/(\bhelp|\badvise)[^A-Z]{0,10}(([A-Z]\w+\s*\b)+)/');
  _start=1;
  _stop=length(have_str);

  call prxnext(_prxid, _start, _stop, have_str, _pos, _len);
  do while (_pos > 0);
    want=catx('|',want,prxposn(_prxid, 2, have_str));
    call prxnext(_prxid, _start, _stop, have_str, _pos, _len);
  end;

run;

proc print data=want;
run;
ChrisNZ
Tourmaline | Level 20

Or:


data WANT;
  set HAVE;
  KEYWORD=prxchange('s/.*?\b(help|advise)[^A-Z]{0,10}((\b[A-Z]\S+ )+).*/$2/o',1,TXT);
  put KEYWORD=;
run;

KEYWORD=Midea
KEYWORD=Volkswagen
KEYWORD=The Carlyle Group

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 875 views
  • 0 likes
  • 7 in conversation