I have an example dataset as follows:
DATA have;
input InvestorID Type & $char1000.;
length Type $1000;
DATALINES;
1001 Accelerator/Incubator
1002 Accelerator/Incubator, Angel (individual), Angel Group, Asset Manager, Business Development Company, Corporate Development, Corporate Venture Capital, Corporation, Family Office, Fund of Funds, Fundless Sponsor, Government, Growth/Expansion, Hedge Fund, Holding Company, Impact Investing, Infrastructure, Investment Bank, Leasing, Lender/Debt Provider, Limited Partner, Merchant Banking Firm, Mezzanine, Mutual Fund, Not-For-Profit Venture Capital, Other Private Equity, PE-Backed Company, PE/Buyout, Real Estate, SBIC, Secondary Buyer, Sovereign Wealth Fund, University, VC-Backed Company, Venture Capital
1003 Accelerator/Incubator, Angel (individual), Business Development Company, Venture Capital
1004 Corporate Venture Capital, Fund of Funds
;
RUN;
What I want to do is produce the following dataset:
DATA want;
input InvestorID Type & $char1000. CVC VC;
length Type $1000;
DATALINES;
1001 Accelerator/Incubator 0 0
1002 Accelerator/Incubator, Angel (individual), Angel Group, Asset Manager, Business Development Company, Corporate Development, Corporate Venture Capital, Corporation, Family Office, Fund of Funds, Fundless Sponsor, Government, Growth/Expansion, Hedge Fund, Holding Company, Impact Investing, Infrastructure, Investment Bank, Leasing, Lender/Debt Provider, Limited Partner, Merchant Banking Firm, Mezzanine, Mutual Fund, Not-For-Profit Venture Capital, Other Private Equity, PE-Backed Company, PE/Buyout, Real Estate, SBIC, Secondary Buyer, Sovereign Wealth Fund, University, VC-Backed Company, Venture Capital 1 1
1003 Accelerator/Incubator, Angel (individual), Business Development Company, Venture Capital 0 1
1004 Corporate Venture Capital, Fund of Funds 1 0
;
RUN;
The CVC variable takes on a value of 1 if in Type, the word Corporate Venture Capital appears, otherwise it is 0. Likewise the VC variable takes on a value of 1 if in Type, the word Venture Capital appears, otherwise it is 0. These words Corporate Venture Capital and Venture Capital are always spelt with the same capitalization.
You have to tell INDEXW() what characters constituent end of a word. Or just use INDEX() instead.
DATA have;
input InvestorID Type & $1000.;
CVC = 0 < indexw(Type,'Corporate Venture Capital',' ,.()');
VC= 0 < indexw(Type,'Venture Capital',' ,.()');
DATALINES;
1001 Accelerator/Incubator
1002 Accelerator/Incubator, Angel (individual), Angel Group, Asset Manager, Business Development Company, Corporate Development, Corporate Venture Capital, Corporation, Family Office, Fund of Funds, Fundless Sponsor, Government, Growth/Expansion, Hedge Fund, Holding Company, Impact Investing, Infrastructure, Investment Bank, Leasing, Lender/Debt Provider, Limited Partner, Merchant Banking Firm, Mezzanine, Mutual Fund, Not-For-Profit Venture Capital, Other Private Equity, PE-Backed Company, PE/Buyout, Real Estate, SBIC, Secondary Buyer, Sovereign Wealth Fund, University, VC-Backed Company, Venture Capital
1003 Accelerator/Incubator, Angel (individual), Business Development Company, Venture Capital
1004 Corporate Venture Capital, Fund of Funds
;
proc print;
run;
I ask for clarification. In several places in your text, you indicate you are looking for a "word", but then the search strings are made up of several words, such as "Corporate Venture Capital". Are we searching for one of these words, or do we have to find all of "Corporate Venture Capital" to get a match?
Sorry, I mean the entire string "Corporate Venture Capital" to get a match. Likewise for the entire string "Venture Capital".
If you just want to find the string, even if it is part of a longer word, then use INDEX() or FIND(). If you want to only find it when it is a stand only word (or phrase in this case) then use INDEXW() or FINDW().
All of those functions return the location the value was found. Which you could treat as TRUE or FALSE since SAS considers any number other than zero or missing as TRUE.
Or you could just add a simple comparison operator to force it to be a true 0/1 boolean value.
Another advantage of the FIND() and FINDW() function is that they accept modifiers to allow you to ignore case and/or trim leading/trailing delimiters.
CVC = 0 < indexw(Type,'Corporate Venture Capital');
CV= 0 < indexw(Type,'Venture Capital');
Thank you. However, when I try the following:
data want;
set have;
CVC = 0 < indexw(Type,'Corporate Venture Capital');
VC= 0 < indexw(Type,'Venture Capital');
run;
It does not seem to be picking up the "Corporate Venture Capital", for example, CVC is 0 for both investorID=1002 and 1004, when it should be 1.
Does below work for you?
DATA have;
input InvestorID Type & $char1000. CVC_want VC_want;
length Type $1000;
DATALINES;
1001 Accelerator/Incubator 0 0
1002 Accelerator/Incubator, Angel (individual), Angel Group, Asset Manager, Business Development Company, Corporate Development, Corporate Venture Capital, Corporation, Family Office, Fund of Funds, Fundless Sponsor, Government, Growth/Expansion, Hedge Fund, Holding Company, Impact Investing, Infrastructure, Investment Bank, Leasing, Lender/Debt Provider, Limited Partner, Merchant Banking Firm, Mezzanine, Mutual Fund, Not-For-Profit Venture Capital, Other Private Equity, PE-Backed Company, PE/Buyout, Real Estate, SBIC, Secondary Buyer, Sovereign Wealth Fund, University, VC-Backed Company, Venture Capital 1 1
1003 Accelerator/Incubator, Angel (individual), Business Development Company, Venture Capital 0 1
1004 Corporate Venture Capital, Fund of Funds 1 0
;
data want;
set have;
cvc= find(type,'Corporate Venture Capital','i')>0;
cv= prxmatch('/(?<!Corporate )Venture Capital/i',strip(type))>0;
run;
proc print data=want(drop=type);
run;
Thanks that works! I also tried the following, which works as well:
type_nc = compress(type);
CVC = 0 < indexw(type_nc,'CorporateVentureCapital',", ");
VC = 0 < indexw(type_nc,'VentureCapital',", ");
You have to tell INDEXW() what characters constituent end of a word. Or just use INDEX() instead.
DATA have;
input InvestorID Type & $1000.;
CVC = 0 < indexw(Type,'Corporate Venture Capital',' ,.()');
VC= 0 < indexw(Type,'Venture Capital',' ,.()');
DATALINES;
1001 Accelerator/Incubator
1002 Accelerator/Incubator, Angel (individual), Angel Group, Asset Manager, Business Development Company, Corporate Development, Corporate Venture Capital, Corporation, Family Office, Fund of Funds, Fundless Sponsor, Government, Growth/Expansion, Hedge Fund, Holding Company, Impact Investing, Infrastructure, Investment Bank, Leasing, Lender/Debt Provider, Limited Partner, Merchant Banking Firm, Mezzanine, Mutual Fund, Not-For-Profit Venture Capital, Other Private Equity, PE-Backed Company, PE/Buyout, Real Estate, SBIC, Secondary Buyer, Sovereign Wealth Fund, University, VC-Backed Company, Venture Capital
1003 Accelerator/Incubator, Angel (individual), Business Development Company, Venture Capital
1004 Corporate Venture Capital, Fund of Funds
;
proc print;
run;
Whichever method you use, searching through a 1000-character string can be time-consuming. If speed is an issue, you can probably speed things up by taking advantage of the fact that "Venture Capital" is a subset of "Corporate Venture Capital". Consider this structure:
VC = (whatever method you choose);
if VC then CVC = (whatever method you choose);
else CVC = 0;
This avoids searching for "Corporate Venture Capital" in cases where it definitely won't exist.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.