BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
shalmali
Calcite | Level 5

Hi All,

I have a code (thanks to Ksharp) that helps me to extract company names from biographies of each individual. The problem however is that I have 20,000 biographies and typing biography of each individual ( as done in the code below) is very time consuming. I have the biographies of all the individuals in an excel file with the following information:

Biography  full_name           Age

text           Frank Smith          50

text           Joe Cohen          65

Can someone please tell me how to modify the code below to get the output for all the 20,000 observations without typing the biographies for each of them?  Please see the attached file with the sample data.

data x;

infile cards dlm=', ';

if _n_ eq 1 then input name & $50. @;

retain name;

input text : $100. @@;

obs+1;

cards4;

  1. Mr. Richard M. Cohen, MBA is the President of Richard M. Cohen Consultants Inc., since 1996 and serves as its Managing Principal. Mr. Cohen has been the Chief Financial Officer of CorMedix, Inc. since January 1, 2013 and serves as its Principal Accounting Officer. He serves as the Chief Executive Officer, Chief Financial Officer and Chief Accounting Officer of Websoft Systems, Inc. He serves as the Managing Director of Strauss Capital Partners LLC. Since 2002, he has been Managing Director of Encore/Novation. He served as an Interim Chief Executive Officer of CorMedix, Inc from 2011 to January 1, 2013 and Interim Chief Financial Officer from May 2, 2012 January 1, 2013. He served as Secretary of Dune Energy Inc. He served as the Chief Executive Officer, Chief Financial Officer and Principal Accounting Officer of Newtown Lane Marketing Incorporated. He served as the Chief Financial Officer of Cross Canyon Energy Corp. (formerly, ABC Funding Inc.) from April 2006 to January 2008 and also served as its Principal Accounting Officer. He served as the President of Pipeline Data, Inc. since January 2001 and served as its Treasurer. He served as the Chief Financial Officer, Principal Accounting Officer, Treasurer and Secretary of Pinpoint Recovery Solutions Corp. He served as the Chief Financial Officer of College Oak Investments Inc. since December 2005. Mr. Cohen served as Chief Financial Officer of Dune Energy Inc. from December 2003 to April 7, 2005 and Manager from April 7, 2005 to May 31, 2005. From 1993 to 1995. Mr. Cohen served as President of General Media Inc. from 1993 to 1995. He served as the Chief Financial Officer of Baseline Oil & Gas Corp., since December 27, 2005 and served as its Principal Accounting Officer. From 1988 to 1993, Mr. Cohen served as a Director of Investment Banking at Furman Selz Inc. In 1999, He served as the President of National Auto Credit, a publicly traded sub-prime auto finance company. From 1984 to 1992, Mr. Cohen served as an Investment Banker of Henry Ansbacher, Furman Selz, where he specialized in Mergers & Acquisitions, Public Equity Offerings, and Restructurings. From 1980 to 1983, he served as a Vice President of Corporate Development of Macmillan. He worked at Arthur Andersen & Co. from 1975 to 1977. He serves as the Executive Chairman of CorMedix, Inc. He has been the Chairman of Chord Advisors since 2012. Mr. Cohen has been a Director of CorMedix, Inc. since December 2009, Cross Media Marketing Corp. since October 1998, Helix Biomedix Inc. since December 14, 2005 and Pinpoint Recovery Solutions Corp. since March 2007. He has been a Director of China Filtration Technology, Inc. since June 2010 and China SLP Filtration Technology, Inc. since June 2010. He serves as a Director of Symposium Corp., Dey & Co., Redirect, Inc., Flaghouse Communications, Greg Manning Auctions Inc., Websoft Systems, Inc. and Interlink Inc. He served as a Director of Dune Energy Inc. from December 2003 to January 17, 2012, Ventrus Biosciences Inc. until November 10, 2010, Universal Travel Group from May 9, 2007 to June 23, 2008 and Direct Markets Holdings Corp. (formerly, Rodman & Renshaw Capital Group, Inc.) from August 2007 to August 21, 2012. He served as a Director of Immune Pharmaceuticals Ltd., and Newtown Lane Marketing Incorporated. He holds a Certified Public Accountant designation from the State of New York. Mr. Cohen received a B.S. cum laude from the Wharton School of Business at the University of Pennsylvania in 1973 and an M.B.A from Stanford University.

;;;;

run;

data x;

set x;

if compress(lowcase(lag(text)),,'ka') in ('incorporated' 'inc' 'corporation' 'corp' 'llc')  then n+1;

run;

data x;

set x;

by n;

if   compress(lowcase(lag(text)),,'ka') in ('of' 'at' 'and' 'a') then m+1;

run;

proc sql;

create table temp as

  select * from x group by n having m=max(m) order by obs;

quit;

data temp1;

set temp;

by n;

length x $ 2000;

retain x;

x=catx(' ',x,text);

if last.n then do; output;call missing(x);end;

keep name x;

run;

data temp2;

set temp1;

id=prxparse('/\bincorporated|\bInc|\bcorporation|\bcorp|\bLLC/io');

if prxmatch(id,x) then output;

drop id;

run;

proc transpose data=temp2 out=want(drop=_name_);

by name;

var x;

run;

Thank you for your time.

1 ACCEPTED SOLUTION

Accepted Solutions
RichardinOz
Quartz | Level 8

You have most of the stuff you need already.  You don't need the first datastep with the cards expression because you have imported the data.  So go from

proc import out=bios;

datafile= "C:\destop\june1.xls"

dbms=xls replace;

getnames=yes;

run;

/* to, replacing text with bios - not a good idea to give your table the same name as a column in it */

data x;

set bios;

if compress(lowcase(lag(bios)),,'ka') in ('incorporated' 'inc' 'corporation' 'corp' 'llc')  then n+1;

run;

/* etc */

/* Skipping the first step means the value obs will not be created but I don't think you need it because your text is all in one line */

proc sql;

create table temp as

  select * from x group by n having m=max(m) /* order by obs */;

quit;

/* etc */

If you encounter any errors, please post the relevant part of the log, including the error message.

BTW I would suggest ordering your Excel file so that the longest bios comes first (create an extra column with the lenth and sort descending by it).  That should ensure no subsequent bios gets truncated when imported into SAS.

Richard

View solution in original post

8 REPLIES 8
RichardinOz
Quartz | Level 8

KSharp's solution assumes that each biography is split into lines no longer than 100 characters.  Is that the case with your Excel data?  If it is, you can adapt the above code to use an infile statement to read data directly from Excel (if you have an up to date version with the "SAS Access to PC File Formats" product); or you can export from Excel as a CSV file and import into SAS.  If your linesize is longer you can adapt the informat.  In the worst case you can cut and paste the bio info into the space between the cards4  and the ;;;; expressions in KSharp's code - no need to retype!

You need to tell us more exactly how your data is structured in Excel and whether you have the SAS Access product mentioned, on the same PC as your Excel instance.

Richard

shalmali
Calcite | Level 5

Dear Richard,

Thanks for your reply. Some biographies are short with few sentences and characters while others are long. I imported the excel file into SAS using the following function:

proc import out=bios;

datafile= "C:\destop\june1.xls"

dbms=xls replace;

getnames=yes;

run;

After the import my work.bios has three columns: executive name, age, and bios. I want to retain the name and age and extract the company names from the bios. After importing a file with few selected observations, I checked the column attributes of the bios. It says the length of this column is 2791 and type is character.

I do not know what I need to do next (after proc import) to get the desired output without having to copy and paste the bios beetween cards4 and ;;;;expression. Also is it possible to have an output without restricting it to 100 characters?

I am currently using SAS 9.3. I dont think it is the lastest version with "SAS Access to PC file format".

Thank you once again,

shalmali

RichardinOz
Quartz | Level 8

You have most of the stuff you need already.  You don't need the first datastep with the cards expression because you have imported the data.  So go from

proc import out=bios;

datafile= "C:\destop\june1.xls"

dbms=xls replace;

getnames=yes;

run;

/* to, replacing text with bios - not a good idea to give your table the same name as a column in it */

data x;

set bios;

if compress(lowcase(lag(bios)),,'ka') in ('incorporated' 'inc' 'corporation' 'corp' 'llc')  then n+1;

run;

/* etc */

/* Skipping the first step means the value obs will not be created but I don't think you need it because your text is all in one line */

proc sql;

create table temp as

  select * from x group by n having m=max(m) /* order by obs */;

quit;

/* etc */

If you encounter any errors, please post the relevant part of the log, including the error message.

BTW I would suggest ordering your Excel file so that the longest bios comes first (create an extra column with the lenth and sort descending by it).  That should ensure no subsequent bios gets truncated when imported into SAS.

Richard

shalmali
Calcite | Level 5

Dear Richard,

Thanks for providing the code. The problem is that "n" and the "m" for all the bios in Data X (after step 3) are zero. In other words, the output of Data X looks as follows:

Name     Age     Bios       n     m

Mr.A     53          xxxx     0     0

Mr.B     60          xxxx     0     0

As a result after running the following code, I am not able to pick the company names from the bios.

proc import out=bios

datafile= "C:\Desktop\june1.xls"

dbms=xls replace;

getnames=yes;

run;

data x;

set bios;

if compress(lowcase(lag(biographies)),,'ka') in ('incorporated' 'inc' 'corporation' 'corp' 'llc'then n+1;

run;

data x;

set x;

by n;

if first.n then m=0;

if   compress(lowcase(lag(biographies)),,'ka') in ('of' 'at' 'and' 'a') then m+1;

run;

proc sql;

create table temp as

  select * from x group by n having m=max(m) /* order by obs */;

quit;

data temp1;

set temp;

by n;

length x $ 2000;

retain x;

x=catx(' ',x,biographies);

if last.n then do; output;call missing(x);end;

keep x;

run;

data temp2;

set temp1;

id=prxparse('/\bincorporated|\bInc|\bcorporation|\bcorp|\bLLC|\bco|\bcompany/io');

if prxmatch(id,x) then output;

drop id;

run;

proc transpose data=temp2 out=want;

var x;

run;

I would grately appreciate if you help me figure it out how to get the desired output.

Thank you for you time.

Shalmali.

RichardinOz
Quartz | Level 8

Your version of the data temp1 step only keeps x where the original also kept name.  Probably you removed name because your column is called something else, perhaps executive_name.  You need to include it and age in the keep statement.

data temp1;

set temp;

by n;

length x $ 2000;

retain x;

x=catx(' ',x,biographies);

if last.n then do; output;call missing(x);end;

keep executive_name age x;

run;

Similarly, in the transpose at the end you need the by statement to include executive_name age.  You may need to sort first.

Proc Sort

    data = temp2 ;

    by executive_name age ;

run;

proc transpose data=temp2 out=want(drop=_name_);

by executive_name age ;

var x;

run;

For any other problems with the code you may have to go back to KSharp.

Richard

RichardinOz
Quartz | Level 8

It's 6 am here and I have missed one of your questions, why m and n are both zero.  Your  version of the code

     if compress(lowcase(lag(biographies)),,'ka')

references the column biographies but your dump still references it as bios.  Did you change the name of the column in Excel?  You need to use the same column name that the data is imported under.

Richard

shalmali
Calcite | Level 5

Thanks Richard for helping me with the code. I greatly appreciate it.

RichardinOz
Quartz | Level 8

You are too kind

Literally

KSharp did the hard work on this one

Richard

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1142 views
  • 6 likes
  • 2 in conversation