Extract relevant information from the text

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 96
Accepted Solution

Extract relevant information from the text


Hi All,

I want to extract company names from the biography of the individuals. In most of the cases, the company name is preceeded by the word "of" or "at" and ends with "inc", or " incorporated" or "corp" or "corporation" or "LLC". In some cases, the company name needs to be extract from the following sentence: "He serves as a Director of Symposium Corp., Dey & Co., Redirect, Inc., Flaghouse Communications, Greg Manning Auctions Inc., Websoft Systems, Inc. and Interlink Inc." That is, the company name will be preceeded by a "," or "and".

I want to extract the company names from the biographies and put it in a separate column as in the example below: (Note: The companies extracted from the biography are more than 10).

BiographyFirst namelast nameco1co2co3co4co5co6co7co8co9co10
Mr. Richard M. Cohen, MBA is the President of Richard M. Cohen Consultants Inc., since 1996 and serves as its Managing Principal. Mr. Cohen has been the Chief Financial Officer of CorMedix, Inc. since January 1, 2013 and serves as its Principal Accounting Officer. He serves as the Chief Executive Officer, Chief Financial Officer and Chief Accounting Officer of Websoft Systems, Inc. He serves as the Managing Director of Strauss Capital Partners LLC. Since 2002, he has been Managing Director of Encore/Novation. He served as an Interim Chief Executive Officer of CorMedix, Inc from 2011 to January 1, 2013 and Interim Chief Financial Officer from May 2, 2012 January 1, 2013. He served as Secretary of Dune Energy Inc. He served as the Chief Executive Officer, Chief Financial Officer and Principal Accounting Officer of Newtown Lane Marketing Incorporated. He served as the Chief Financial Officer of Cross Canyon Energy Corp. (formerly, ABC Funding Inc.) from April 2006 to January 2008 and also served as its Principal Accounting Officer. He served as the President of Pipeline Data, Inc. since January 2001 and served as its Treasurer. He served as the Chief Financial Officer, Principal Accounting Officer, Treasurer and Secretary of Pinpoint Recovery Solutions Corp. He served as the Chief Financial Officer of College Oak Investments Inc. since December 2005. Mr. Cohen served as Chief Financial Officer of Dune Energy Inc. from December 2003 to April 7, 2005 and Manager from April 7, 2005 to May 31, 2005. From 1993 to 1995. Mr. Cohen served as President of General Media Inc. from 1993 to 1995. He served as the Chief Financial Officer of Baseline Oil & Gas Corp., since December 27, 2005 and served as its Principal Accounting Officer. From 1988 to 1993, Mr. Cohen served as a Director of Investment Banking at Furman Selz Inc. In 1999, He served as the President of National Auto Credit, a publicly traded sub-prime auto finance company. From 1984 to 1992, Mr. Cohen served as an Investment Banker of Henry Ansbacher, Furman Selz, where he specialized in Mergers & Acquisitions, Public Equity Offerings, and Restructurings. From 1980 to 1983, he served as a Vice President of Corporate Development of Macmillan. He worked at Arthur Andersen & Co. from 1975 to 1977. He serves as the Executive Chairman of CorMedix, Inc. He has been the Chairman of Chord Advisors since 2012. Mr. Cohen has been a Director of CorMedix, Inc. since December 2009, Cross Media Marketing Corp. since October 1998, Helix Biomedix Inc. since December 14, 2005 and Pinpoint Recovery Solutions Corp. since March 2007. He has been a Director of China Filtration Technology, Inc. since June 2010 and China SLP Filtration Technology, Inc. since June 2010. He serves as a Director of Symposium Corp., Dey & Co., Redirect, Inc., Flaghouse Communications, Greg Manning Auctions Inc., Websoft Systems, Inc. and Interlink Inc. He served as a Director of Dune Energy Inc. from December 2003 to January 17, 2012, Ventrus Biosciences Inc. until November 10, 2010, Universal Travel Group from May 9, 2007 to June 23, 2008 and Direct Markets Holdings Corp. (formerly, Rodman & Renshaw Capital Group, Inc.) from August 2007 to August 21, 2012. He served as a Director of Immune Pharmaceuticals Ltd., and Newtown Lane Marketing Incorporated. He holds a Certified Public Accountant designation from the State of New York. Mr. Cohen received a B.S. cum laude from the Wharton School of Business at the University of Pennsylvania in 1973 and an M.B.A from Stanford University.RichardCohenRichard M. Cohen Consultants IncCorMedix, Inc.Websoft Systems, IncStrauss Capital Partners LLC.Encore/NovationDune Energy Inc.Newtown Lane Marketing Incorporated.Canyon Energy Corp.Pipeline Data, Inc.College Oak Investments Inc.

Is it possible to do it in SAS? If yes, can someone please share the code with me?

Thank you for your help.


Accepted Solutions
Solution
‎05-17-2013 09:16 PM
Super User
Posts: 9,681

Re: Extract relevant information from the text

(1) how to capture the firm if its name is written as Procter & Gamble. That is, two consecutive words whose first letter is an upper case and there is a symbol "&" between them.

(2) Is it possible to capture a company's name if it is written as "Alliance Boots". That is, the first letter of two consecutive words is Upper case and the first word is not preceeded by a period (.) and the two words are not same as the first name and the last name of the executive.

My code could capture them, as long as they are ended with 'Inc' 'Corp' ............

(3) I have around 200+ observations. Should I run the above code separately for each observation?

No. You don't need. But That is up to what your data looks like.

(4) Can you please tell me how can I retain the first name and the last name of the individual in the final output?

data x;
infile cards dlm=', ';
if _n_ eq 1 then input name & $50. @;
retain name;
input text : $100. @@;
obs+1;
cards4;
Mr. Richard M. Cohen, MBA is the President of Richard M. Cohen Consultants Inc., since 1996 and serves as its Managing Principal. Mr. Cohen has been the Chief Financial Officer of CorMedix, Inc. since January 1, 2013 and serves as its Principal Accounting Officer. He serves as the Chief Executive Officer, Chief Financial Officer and Chief Accounting Officer of Websoft Systems, Inc. He serves as the Managing Director of Strauss Capital Partners LLC. Since 2002, he has been Managing Director of Encore/Novation. He served as an Interim Chief Executive Officer of CorMedix, Inc from 2011 to January 1, 2013 and Interim Chief Financial Officer from May 2, 2012 January 1, 2013. He served as Secretary of Dune Energy Inc. He served as the Chief Executive Officer, Chief Financial Officer and Principal Accounting Officer of Newtown Lane Marketing Incorporated. He served as the Chief Financial Officer of Cross Canyon Energy Corp. (formerly, ABC Funding Inc.) from April 2006 to January 2008 and also served as its Principal Accounting Officer. He served as the President of Pipeline Data, Inc. since January 2001 and served as its Treasurer. He served as the Chief Financial Officer, Principal Accounting Officer, Treasurer and Secretary of Pinpoint Recovery Solutions Corp. He served as the Chief Financial Officer of College Oak Investments Inc. since December 2005. Mr. Cohen served as Chief Financial Officer of Dune Energy Inc. from December 2003 to April 7, 2005 and Manager from April 7, 2005 to May 31, 2005. From 1993 to 1995. Mr. Cohen served as President of General Media Inc. from 1993 to 1995. He served as the Chief Financial Officer of Baseline Oil & Gas Corp., since December 27, 2005 and served as its Principal Accounting Officer. From 1988 to 1993, Mr. Cohen served as a Director of Investment Banking at Furman Selz Inc. In 1999, He served as the President of National Auto Credit, a publicly traded sub-prime auto finance company. From 1984 to 1992, Mr. Cohen served as an Investment Banker of Henry Ansbacher, Furman Selz, where he specialized in Mergers & Acquisitions, Public Equity Offerings, and Restructurings. From 1980 to 1983, he served as a Vice President of Corporate Development of Macmillan. He worked at Arthur Andersen & Co. from 1975 to 1977. He serves as the Executive Chairman of CorMedix, Inc. He has been the Chairman of Chord Advisors since 2012. Mr. Cohen has been a Director of CorMedix, Inc. since December 2009, Cross Media Marketing Corp. since October 1998, Helix Biomedix Inc. since December 14, 2005 and Pinpoint Recovery Solutions Corp. since March 2007. He has been a Director of China Filtration Technology, Inc. since June 2010 and China SLP Filtration Technology, Inc. since June 2010. He serves as a Director of Symposium Corp., Dey & Co., Redirect, Inc., Flaghouse Communications, Greg Manning Auctions Inc., Websoft Systems, Inc. and Interlink Inc. He served as a Director of Dune Energy Inc. from December 2003 to January 17, 2012, Ventrus Biosciences Inc. until November 10, 2010, Universal Travel Group from May 9, 2007 to June 23, 2008 and Direct Markets Holdings Corp. (formerly, Rodman & Renshaw Capital Group, Inc.) from August 2007 to August 21, 2012. He served as a Director of Immune Pharmaceuticals Ltd., and Newtown Lane Marketing Incorporated. He holds a Certified Public Accountant designation from the State of New York. Mr. Cohen received a B.S. cum laude from the Wharton School of Business at the University of Pennsylvania in 1973 and an M.B.A from Stanford University.
;;;;
run;
data x;
 set x;
 if compress(lowcase(lag(text)),,'ka') in ('incorporated' 'inc' 'corporation' 'corp' 'llc')  then n+1;
run;
data x;
 set x;
 by n;
 if   compress(lowcase(lag(text)),,'ka') in ('of' 'at' 'and' 'a') then m+1;
run;
proc sql;
 create table temp as
  select * from x group by n having m=max(m) order by obs;
quit;

data temp1;
 set temp;
 by n;
 length x $ 2000;
 retain x;
 x=catx(' ',x,text);
 if last.n then do; output;call missing(x);end;
 keep name x;
 run;
 
data temp2;
 set temp1;
 id=prxparse('/\bincorporated|\bInc|\bcorporation|\bcorp|\bLLC/io');
if prxmatch(id,x) then output;
drop id;
run;
proc transpose data=temp2 out=want(drop=_name_);
by name;
 var x;
run;


Ksharp

View solution in original post


All Replies
Super Contributor
Posts: 644

Re: Extract relevant information from the text

Here are some suggestions to get you started

  1. Detect the name, titles and other honorifics and Retain this information for each subsequent company record.
  2. Create a separate record for each company, with the name of the person attached to each record.  This saves having to know how many companies are involved.
  3. Focus on words commencing with a capital letter.  They can be categorised as names, roles, company names, academic titles (MBA) and honorifics (Mr); or extraneous data like names of months or words that start a sentence.
  4. Make a list of Role words, and company termination words; and other lists of Months and common extraneous words to use in determining how to analyse each capitalised word.
  5. For each entry, extract the name, with title and honorifics if required.  Further occurrences of the name might be part of a company name or extraneous data for the purpose of this extraction.
  6. Continue until a Role word (President, Acting, Chief, etc) is encountered.  This should trigger a search for company name words, skipping over any further role words.  It may be useful to note whether 'of' or 'at' follows a role word.
  7. If a role word is followed by extraneous data such as dates or new sentence words, cancel the trigger.
  8. Otherwise store all words as parts of the company name until a company name termination word (Corp., Inc. etc) or an end of sentence is detected, or other extraneous words are found.
  9. Output the record for that company and continue the search for the next.
  10. At the end of the input record, start the process with the next name.

Note that some company names can begin with a lower case letter, which may cause problems, or may include symbols like &, or dates.

Hope this helps

Richard

Super User
Posts: 9,681

Re: Extract relevant information from the text

OK. That is really not easy.

data x;
input text : $100. @@;
obs+1;
cards4;
Mr. Richard M. Cohen, MBA is the President of Richard M. Cohen Consultants Inc., since 1996 and serves as its Managing Principal. Mr. Cohen has been the Chief Financial Officer of CorMedix, Inc. since January 1, 2013 and serves as its Principal Accounting Officer. He serves as the Chief Executive Officer, Chief Financial Officer and Chief Accounting Officer of Websoft Systems, Inc. He serves as the Managing Director of Strauss Capital Partners LLC. Since 2002, he has been Managing Director of Encore/Novation. He served as an Interim Chief Executive Officer of CorMedix, Inc from 2011 to January 1, 2013 and Interim Chief Financial Officer from May 2, 2012 January 1, 2013. He served as Secretary of Dune Energy Inc. He served as the Chief Executive Officer, Chief Financial Officer and Principal Accounting Officer of Newtown Lane Marketing Incorporated. He served as the Chief Financial Officer of Cross Canyon Energy Corp. (formerly, ABC Funding Inc.) from April 2006 to January 2008 and also served as its Principal Accounting Officer. He served as the President of Pipeline Data, Inc. since January 2001 and served as its Treasurer. He served as the Chief Financial Officer, Principal Accounting Officer, Treasurer and Secretary of Pinpoint Recovery Solutions Corp. He served as the Chief Financial Officer of College Oak Investments Inc. since December 2005. Mr. Cohen served as Chief Financial Officer of Dune Energy Inc. from December 2003 to April 7, 2005 and Manager from April 7, 2005 to May 31, 2005. From 1993 to 1995. Mr. Cohen served as President of General Media Inc. from 1993 to 1995. He served as the Chief Financial Officer of Baseline Oil & Gas Corp., since December 27, 2005 and served as its Principal Accounting Officer. From 1988 to 1993, Mr. Cohen served as a Director of Investment Banking at Furman Selz Inc. In 1999, He served as the President of National Auto Credit, a publicly traded sub-prime auto finance company. From 1984 to 1992, Mr. Cohen served as an Investment Banker of Henry Ansbacher, Furman Selz, where he specialized in Mergers & Acquisitions, Public Equity Offerings, and Restructurings. From 1980 to 1983, he served as a Vice President of Corporate Development of Macmillan. He worked at Arthur Andersen & Co. from 1975 to 1977. He serves as the Executive Chairman of CorMedix, Inc. He has been the Chairman of Chord Advisors since 2012. Mr. Cohen has been a Director of CorMedix, Inc. since December 2009, Cross Media Marketing Corp. since October 1998, Helix Biomedix Inc. since December 14, 2005 and Pinpoint Recovery Solutions Corp. since March 2007. He has been a Director of China Filtration Technology, Inc. since June 2010 and China SLP Filtration Technology, Inc. since June 2010. He serves as a Director of Symposium Corp., Dey & Co., Redirect, Inc., Flaghouse Communications, Greg Manning Auctions Inc., Websoft Systems, Inc. and Interlink Inc. He served as a Director of Dune Energy Inc. from December 2003 to January 17, 2012, Ventrus Biosciences Inc. until November 10, 2010, Universal Travel Group from May 9, 2007 to June 23, 2008 and Direct Markets Holdings Corp. (formerly, Rodman & Renshaw Capital Group, Inc.) from August 2007 to August 21, 2012. He served as a Director of Immune Pharmaceuticals Ltd., and Newtown Lane Marketing Incorporated. He holds a Certified Public Accountant designation from the State of New York. Mr. Cohen received a B.S. cum laude from the Wharton School of Business at the University of Pennsylvania in 1973 and an M.B.A from Stanford University.
;;;;
run;
data x;
 set x;
 if compress(lowcase(lag(text)),,'ka') in ('incorporated' 'inc' 'corporation' 'corp' 'llc')  then n+1;
run;
data x;
 set x;
 by n;
 if first.n then m=0;
 if   compress(lowcase(lag(text)),,'ka') in ('of' 'at' 'and' 'a') then m+1;
run;
proc sql;
 create table temp as
  select * from x group by n having m=max(m) order by obs;
quit;

data temp1;
 set temp;
 by n;
 length x $ 2000;
 retain x;
 x=catx(' ',x,text);
 if last.n then do; output;call missing(x);end;
 keep x;
 run;
 
data temp2;
 set temp1;
 id=prxparse('/\bincorporated|\bInc|\bcorporation|\bcorp|\bLLC/io');
if prxmatch(id,x) then output;
drop id;
run;
proc transpose data=temp2 out=want;
 var x;
run;

Ksharp

Frequent Contributor
Posts: 96

Re: Extract relevant information from the text

Thanks Ksharp for the codes. The code works well. I would greatly appreciate if you tell me how to modify the code to accomodate the following:

(1) how to capture the firm if its name is written as Procter & Gamble. That is, two consecutive words whose first letter is an upper case and there is a symbol "&" between them.

(2) Is it possible to capture a company's name if it is written as "Alliance Boots". That is, the first letter of two consecutive words is Upper case and the first word is not preceeded by a period (.) and the two words are not same as the first name and the last name of the executive.

(3) I have around 200+ observations. Should I run the above code separately for each observation?

(4) Can you please tell me how can I retain the first name and the last name of the individual in the final output?

Thanks for all the help.

Solution
‎05-17-2013 09:16 PM
Super User
Posts: 9,681

Re: Extract relevant information from the text

(1) how to capture the firm if its name is written as Procter & Gamble. That is, two consecutive words whose first letter is an upper case and there is a symbol "&" between them.

(2) Is it possible to capture a company's name if it is written as "Alliance Boots". That is, the first letter of two consecutive words is Upper case and the first word is not preceeded by a period (.) and the two words are not same as the first name and the last name of the executive.

My code could capture them, as long as they are ended with 'Inc' 'Corp' ............

(3) I have around 200+ observations. Should I run the above code separately for each observation?

No. You don't need. But That is up to what your data looks like.

(4) Can you please tell me how can I retain the first name and the last name of the individual in the final output?

data x;
infile cards dlm=', ';
if _n_ eq 1 then input name & $50. @;
retain name;
input text : $100. @@;
obs+1;
cards4;
Mr. Richard M. Cohen, MBA is the President of Richard M. Cohen Consultants Inc., since 1996 and serves as its Managing Principal. Mr. Cohen has been the Chief Financial Officer of CorMedix, Inc. since January 1, 2013 and serves as its Principal Accounting Officer. He serves as the Chief Executive Officer, Chief Financial Officer and Chief Accounting Officer of Websoft Systems, Inc. He serves as the Managing Director of Strauss Capital Partners LLC. Since 2002, he has been Managing Director of Encore/Novation. He served as an Interim Chief Executive Officer of CorMedix, Inc from 2011 to January 1, 2013 and Interim Chief Financial Officer from May 2, 2012 January 1, 2013. He served as Secretary of Dune Energy Inc. He served as the Chief Executive Officer, Chief Financial Officer and Principal Accounting Officer of Newtown Lane Marketing Incorporated. He served as the Chief Financial Officer of Cross Canyon Energy Corp. (formerly, ABC Funding Inc.) from April 2006 to January 2008 and also served as its Principal Accounting Officer. He served as the President of Pipeline Data, Inc. since January 2001 and served as its Treasurer. He served as the Chief Financial Officer, Principal Accounting Officer, Treasurer and Secretary of Pinpoint Recovery Solutions Corp. He served as the Chief Financial Officer of College Oak Investments Inc. since December 2005. Mr. Cohen served as Chief Financial Officer of Dune Energy Inc. from December 2003 to April 7, 2005 and Manager from April 7, 2005 to May 31, 2005. From 1993 to 1995. Mr. Cohen served as President of General Media Inc. from 1993 to 1995. He served as the Chief Financial Officer of Baseline Oil & Gas Corp., since December 27, 2005 and served as its Principal Accounting Officer. From 1988 to 1993, Mr. Cohen served as a Director of Investment Banking at Furman Selz Inc. In 1999, He served as the President of National Auto Credit, a publicly traded sub-prime auto finance company. From 1984 to 1992, Mr. Cohen served as an Investment Banker of Henry Ansbacher, Furman Selz, where he specialized in Mergers & Acquisitions, Public Equity Offerings, and Restructurings. From 1980 to 1983, he served as a Vice President of Corporate Development of Macmillan. He worked at Arthur Andersen & Co. from 1975 to 1977. He serves as the Executive Chairman of CorMedix, Inc. He has been the Chairman of Chord Advisors since 2012. Mr. Cohen has been a Director of CorMedix, Inc. since December 2009, Cross Media Marketing Corp. since October 1998, Helix Biomedix Inc. since December 14, 2005 and Pinpoint Recovery Solutions Corp. since March 2007. He has been a Director of China Filtration Technology, Inc. since June 2010 and China SLP Filtration Technology, Inc. since June 2010. He serves as a Director of Symposium Corp., Dey & Co., Redirect, Inc., Flaghouse Communications, Greg Manning Auctions Inc., Websoft Systems, Inc. and Interlink Inc. He served as a Director of Dune Energy Inc. from December 2003 to January 17, 2012, Ventrus Biosciences Inc. until November 10, 2010, Universal Travel Group from May 9, 2007 to June 23, 2008 and Direct Markets Holdings Corp. (formerly, Rodman & Renshaw Capital Group, Inc.) from August 2007 to August 21, 2012. He served as a Director of Immune Pharmaceuticals Ltd., and Newtown Lane Marketing Incorporated. He holds a Certified Public Accountant designation from the State of New York. Mr. Cohen received a B.S. cum laude from the Wharton School of Business at the University of Pennsylvania in 1973 and an M.B.A from Stanford University.
;;;;
run;
data x;
 set x;
 if compress(lowcase(lag(text)),,'ka') in ('incorporated' 'inc' 'corporation' 'corp' 'llc')  then n+1;
run;
data x;
 set x;
 by n;
 if   compress(lowcase(lag(text)),,'ka') in ('of' 'at' 'and' 'a') then m+1;
run;
proc sql;
 create table temp as
  select * from x group by n having m=max(m) order by obs;
quit;

data temp1;
 set temp;
 by n;
 length x $ 2000;
 retain x;
 x=catx(' ',x,text);
 if last.n then do; output;call missing(x);end;
 keep name x;
 run;
 
data temp2;
 set temp1;
 id=prxparse('/\bincorporated|\bInc|\bcorporation|\bcorp|\bLLC/io');
if prxmatch(id,x) then output;
drop id;
run;
proc transpose data=temp2 out=want(drop=_name_);
by name;
 var x;
run;


Ksharp

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 272 views
  • 6 likes
  • 3 in conversation