Hi All,
I have a code (thanks to Ksharp) that helps me to extract company names from biographies of each individual. The problem however is that I have 20,000 biographies and typing biography of each individual ( as done in the code below) is very time consuming. I have the biographies of all the individuals in an excel file with the following information:
Biography full_name Age
text Frank Smith 50
text Joe Cohen 65
Can someone please tell me how to modify the code below to get the output for all the 20,000 observations without typing the biographies for each of them? Please see the attached file with the sample data.
data x;
infile cards dlm=', ';
if _n_ eq 1 then input name & $50. @;
retain name;
input text : $100. @@;
obs+1;
cards4;
;;;;
run;
data x;
set x;
if compress(lowcase(lag(text)),,'ka') in ('incorporated' 'inc' 'corporation' 'corp' 'llc') then n+1;
run;
data x;
set x;
by n;
if compress(lowcase(lag(text)),,'ka') in ('of' 'at' 'and' 'a') then m+1;
run;
proc sql;
create table temp as
select * from x group by n having m=max(m) order by obs;
quit;
data temp1;
set temp;
by n;
length x $ 2000;
retain x;
x=catx(' ',x,text);
if last.n then do; output;call missing(x);end;
keep name x;
run;
data temp2;
set temp1;
id=prxparse('/\bincorporated|\bInc|\bcorporation|\bcorp|\bLLC/io');
if prxmatch(id,x) then output;
drop id;
run;
proc transpose data=temp2 out=want(drop=_name_);
by name;
var x;
run;
Thank you for your time.
You have most of the stuff you need already. You don't need the first datastep with the cards expression because you have imported the data. So go from
proc import out=bios;
datafile= "C:\destop\june1.xls"
dbms=xls replace;
getnames=yes;
run;
/* to, replacing text with bios - not a good idea to give your table the same name as a column in it */
data x;
set bios;
if compress(lowcase(lag(bios)),,'ka') in ('incorporated' 'inc' 'corporation' 'corp' 'llc') then n+1;
run;
/* etc */
/* Skipping the first step means the value obs will not be created but I don't think you need it because your text is all in one line */
proc sql;
create table temp as
select * from x group by n having m=max(m) /* order by obs */;
quit;
/* etc */
If you encounter any errors, please post the relevant part of the log, including the error message.
BTW I would suggest ordering your Excel file so that the longest bios comes first (create an extra column with the lenth and sort descending by it). That should ensure no subsequent bios gets truncated when imported into SAS.
Richard
KSharp's solution assumes that each biography is split into lines no longer than 100 characters. Is that the case with your Excel data? If it is, you can adapt the above code to use an infile statement to read data directly from Excel (if you have an up to date version with the "SAS Access to PC File Formats" product); or you can export from Excel as a CSV file and import into SAS. If your linesize is longer you can adapt the informat. In the worst case you can cut and paste the bio info into the space between the cards4 and the ;;;; expressions in KSharp's code - no need to retype!
You need to tell us more exactly how your data is structured in Excel and whether you have the SAS Access product mentioned, on the same PC as your Excel instance.
Richard
Dear Richard,
Thanks for your reply. Some biographies are short with few sentences and characters while others are long. I imported the excel file into SAS using the following function:
proc import out=bios;
datafile= "C:\destop\june1.xls"
dbms=xls replace;
getnames=yes;
run;
After the import my work.bios has three columns: executive name, age, and bios. I want to retain the name and age and extract the company names from the bios. After importing a file with few selected observations, I checked the column attributes of the bios. It says the length of this column is 2791 and type is character.
I do not know what I need to do next (after proc import) to get the desired output without having to copy and paste the bios beetween cards4 and ;;;;expression. Also is it possible to have an output without restricting it to 100 characters?
I am currently using SAS 9.3. I dont think it is the lastest version with "SAS Access to PC file format".
Thank you once again,
shalmali
You have most of the stuff you need already. You don't need the first datastep with the cards expression because you have imported the data. So go from
proc import out=bios;
datafile= "C:\destop\june1.xls"
dbms=xls replace;
getnames=yes;
run;
/* to, replacing text with bios - not a good idea to give your table the same name as a column in it */
data x;
set bios;
if compress(lowcase(lag(bios)),,'ka') in ('incorporated' 'inc' 'corporation' 'corp' 'llc') then n+1;
run;
/* etc */
/* Skipping the first step means the value obs will not be created but I don't think you need it because your text is all in one line */
proc sql;
create table temp as
select * from x group by n having m=max(m) /* order by obs */;
quit;
/* etc */
If you encounter any errors, please post the relevant part of the log, including the error message.
BTW I would suggest ordering your Excel file so that the longest bios comes first (create an extra column with the lenth and sort descending by it). That should ensure no subsequent bios gets truncated when imported into SAS.
Richard
Dear Richard,
Thanks for providing the code. The problem is that "n" and the "m" for all the bios in Data X (after step 3) are zero. In other words, the output of Data X looks as follows:
Name Age Bios n m
Mr.A 53 xxxx 0 0
Mr.B 60 xxxx 0 0
As a result after running the following code, I am not able to pick the company names from the bios.
proc import out=bios
datafile= "C:\Desktop\june1.xls"
dbms=xls replace;
getnames=yes;
run;
data x;
set bios;
if compress(lowcase(lag(biographies)),,'ka') in ('incorporated' 'inc' 'corporation' 'corp' 'llc') then n+1;
run;
data x;
set x;
by n;
if first.n then m=0;
if compress(lowcase(lag(biographies)),,'ka') in ('of' 'at' 'and' 'a') then m+1;
run;
proc sql;
create table temp as
select * from x group by n having m=max(m) /* order by obs */;
quit;
data temp1;
set temp;
by n;
length x $ 2000;
retain x;
x=catx(' ',x,biographies);
if last.n then do; output;call missing(x);end;
keep x;
run;
data temp2;
set temp1;
id=prxparse('/\bincorporated|\bInc|\bcorporation|\bcorp|\bLLC|\bco|\bcompany/io');
if prxmatch(id,x) then output;
drop id;
run;
proc transpose data=temp2 out=want;
var x;
run;
I would grately appreciate if you help me figure it out how to get the desired output.
Thank you for you time.
Shalmali.
Your version of the data temp1 step only keeps x where the original also kept name. Probably you removed name because your column is called something else, perhaps executive_name. You need to include it and age in the keep statement.
data temp1;
set temp;
by n;
length x $ 2000;
retain x;
x=catx(' ',x,biographies);
if last.n then do; output;call missing(x);end;
keep executive_name age x;
run;
Similarly, in the transpose at the end you need the by statement to include executive_name age. You may need to sort first.
Proc Sort
data = temp2 ;
by executive_name age ;
run;
proc transpose data=temp2 out=want(drop=_name_);
by executive_name age ;
var x;
run;
For any other problems with the code you may have to go back to KSharp.
Richard
It's 6 am here and I have missed one of your questions, why m and n are both zero. Your version of the code
if compress(lowcase(lag(biographies)),,'ka')
references the column biographies but your dump still references it as bios. Did you change the name of the column in Excel? You need to use the same column name that the data is imported under.
Richard
Thanks Richard for helping me with the code. I greatly appreciate it.
You are too kind
Literally
KSharp did the hard work on this one
Richard
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.