Hi,
I would say I am very new to SAS...so please kindly help me. I am struggling with a very simple task. I want to be able to identify people with a bachelor's degree. Below is the sample of the data:
people | obtain_year | qualification |
Tom | 1990 | BA |
Tom | 2002 | master |
Tom | 2010 | MBA |
Alice | 1994Jan | bachelor's degree |
Alice | 2011 | EMBA |
Lou | 2005 | BS |
Lou | 2006Jun01 | MSc |
Lou | 2012 | master of business administration |
My desired outcome should create columns of indicators flagging who has bachelor, master, and MBA degrees, regardless of the uppercase and lowercase.
people | obtain_year | qualification | bachelor | master | MBA |
Tom | 1990 | BA | 1 | ||
Tom | 2002 | master | 1 | ||
Tom | 2010 | MBA | 1 | ||
Alice | 1994Jan | bachelor's degree | 1 | ||
Alice | 2011 | EMBA | |||
Lou | 2005 | BS | 1 | ||
Lou | 2006Jun01 | MSc | 1 | ||
Lou | 2012 | master of business administration | 1 |
I tried using 'findw' and 'index', but these functions will pick 'MBA' as well when I only want to include 'BA'. Bachelor degrees should be equal to 1 when the qualification contains the exact words: 'bachelor', 'BA', 'BS', or 'bachelor's degree'.
Thank you so much for your assistance in advance, as I really need to get this done soon in order for the next step of my research.
Please provide sample data in the form of a working DATA step, and the sas code you are using. You'll get better advice, and (likely) tested code.
By the way, using FINDW (find word) to detect the unigram 'BA' should not give a false positive searching a string containing 'MBA'.
Code not tested because there was no ready-made sample data but should work.
data want;
set have;
length Batchelor Master MBA $1;
if findw(qualification,'ba','it')>0 then Batchelor='1';
else if findw(qualification,'bachelor','it')>0 then Batchelor='1';
else if findw(qualification,'master','it')>0 then Master='1';
else if findw(qualification,'mba','it')>0 then MBA='1';
run;
Hi,
Thank you so much for your code. I managed to solve my problem using the following codes:
data education;
set org.na_dir_profile_education;
if findw(upcase(qualification), 'BACHELOR')>=1 then bachelor = 1;
if findw(upcase(qualification), "BACHELOR'S DEGREE")>=1 then bachelor = 1;
if findw(upcase(qualification), 'BS')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BSC')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BE')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BSE')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BENG')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BA')>=1 then bachelor = 1;
if findw(upcase(qualification), 'B Acc')>=1 then bachelor = 1;
if findw(upcase(qualification), 'B Arch')>=1 then bachelor = 1;
if findw(upcase(qualification), 'B.Acc')>=1 then bachelor = 1;
if findw(upcase(qualification), 'B.Math')>=1 then bachelor = 1;
if findw(upcase(qualification), 'B.Proc')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BAS')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BASc')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BAppSc')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BArch')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BBA')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BBM')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BBS')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BCA')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BCL')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BCom')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BComm')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BCompt')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BEc')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BEcon')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BEd')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BFA')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BInf')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BInfTech')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BJ')>=1 then bachelor = 1;
run;
May I know if there is simplified way like writing a macro to shorten the above, as there are at least 30-40 qualified words for my bachelor indicator?
Thank you.
@Jarvin99 wrote:
....
May I know if there is simplified way like writing a macro to shorten the above, as there are at least 30-40 qualified words for my bachelor indicator?
There might be options to write something dynamic but if and what really is suitable for your data depends on what you have and your search terms.
You were talking about searching for WORD (not strings that can be part of a word) but looking into your code there might be some challenges what defines a word delimiter.
For example based on below it appears both a blank and a fullstop can't always serve as word delimiter.
To create something dynamic the code logic would need to "know" when it can't use the normal word delimiters and what can serve as delimiter.
if findw(upcase(qualification), 'B Acc')>=1 then bachelor = 1;
if findw(upcase(qualification), 'B Arch')>=1 then bachelor = 1;
if findw(upcase(qualification), 'B.Acc')>=1 then bachelor = 1;
If you have a series of IF statements that are exclusive then use ELSE IF. There is really no need to execute all the IF's after an expression returns true.
None of above 3 expressions will ever return TRUE because if not explicitly specified via a parameter findw() will use blank and fullstop as word delimiters and though can never find a term that contains such a word delimiter. You also don't need to upcase the source variable if you use the findw() "i" modifier that instructs the function to search case insensitive.
characters in findw() let's you define explicitly what to use as word delimiters if you can't use the default.
I suggest that you read in the docu the details about the findw() function to fully understand what functionality is available and then investigate your search terms to identify the challenges.
Note that for FINDW() (and other WORD finding functions) it is perfectly fine to search for "words" that contain delimiters.
So
findw(string,'Word1 word2',' ','i')
Will find WORD1 followed by a single space and WORD2 that is either at the beginning or end of the string or between two spaces.
Perhaps:
data education; set org.na_dir_profile_education; if upcase(qualification) in ( 'BACHELOR' "BACHELOR'S DEGREE" 'BS' 'BSC' 'BE' 'BSE' 'BENG' 'BA' 'B ACC' 'B ARCH' 'B.ACC' 'B.MATH' 'B.PROC' 'BAS' 'BASC' 'BAPPSC' 'BARCH' 'BBA' 'BBM' 'BBS' 'BCA' 'BCL' 'BCOM' 'BCOMM' 'BCOMPT' 'BEC' 'BECON' 'BED' 'BFA' 'BINF' 'BINFTECH' 'BJ' ) then bachelor=1; run;
will reduce some of your coding since you apparently have WAY MORE values than your question stated.
Not to mention that several of your FINDW were looking for not upcased value in the search string.
Findw does have the i optional modifier to ignore case.
@Jarvin99 wrote:
Hi,
Thank you so much for your code. I managed to solve my problem using the following codes:
data education; set org.na_dir_profile_education; if findw(upcase(qualification), 'BACHELOR')>=1 then bachelor = 1; if findw(upcase(qualification), "BACHELOR'S DEGREE")>=1 then bachelor = 1; if findw(upcase(qualification), 'BS')>=1 then bachelor = 1; ... many lines deleted ...
if findw(upcase(qualification), 'BInfTech')>=1 then bachelor = 1; if findw(upcase(qualification), 'BJ')>=1 then bachelor = 1; run;May I know if there is simplified way like writing a macro to shorten the above, as there are at least 30-40 qualified words for my bachelor indicator?
In addition to the comments others have made about using ELSE IF construct's to avoid doing superfluous IF test's, you should consider using a _TEMPORARY array of the search terms as a code-saving device, as in:
data education (drop=i _:);
set org.na_dir_profile_education;
array _text_ba {10} $20 _temporary_
('BACHELOR','BS','BSC','BE','BSE','BENG','BA',
'BAS','BASc','BAppSc')
bachelor=0;
do i=1 to dim(_text_ba) until (bachelor=1);
bachelor=(findw(_upcase_qual,trim(_unigrams_ba{i}))>=1);
end;
run;
The above ignores upper/lower case issues, which are easily addressed. More important is the issue of searching for two-word phrases ("bigrams" in this note. One-word phrases are unigrams). FINDW is not meant to find them. Here's a workaround, which divides your search terms into unigrams and bigrams:
data education (drop=i _:);
set org.na_dir_profile_education;
set have;
array _unigrams_ba {26} $20 _temporary_
('BACHELOR','BS','BSC','BE','BSE','BENG','BA',
'BAS','BASc','BAppSc','BArch','BBA','BBM','BBS','BCA',
'BCL','BCom','BComm','BCompt','BEc','BEcon','BEd','BFA',
'BInf','BInfTech','BJ');
array _bigrams_ba {6} $20 _temporary_
("BACHELOR'S DEGREE",'B Acc','B Arch','B.Acc','B.Math','B.Proc');
/*Make everything upper-case, for finding purposes */
_upcase_qual=upcase(qualification);
if _n_=1 then do;
do i=1 to dim(_unigrams_ba); unigrams_ba{i}=upcase(_unigrams_ba{i}); end;
do i=1 to dim(_bigrams_ba); bigrams_ba{i}=upcase(_bigrams_ba{i}); end;
end;
bachelor=0;
do i=1 to dim(_unigrams_ba) until (bachelor=1);
bachelor=(findw(_upcase_qual,trim(_unigrams_ba{i}))>=1);
end;
if bachelor=0 then do i=1 to dim(_bigrams_ba) until (bachelor=1);
_w1=findw(_upcase_qual,trim(scan(_bigrams_ba{i},1)),' .','e');
_w2=findw(_upcase_qual,trim(scan(_bigrams_ba{i},2)),' .','e');
bachelor=(_w2=_w1+1) and (_w1>0);
end;
run;
The "trick" here in dealing with bigrams is to use a feature of FINDW (the 'e' as the 4th parameter of FINDW) that returns the word-sequence number rather than the character position of a search-word inside a string-of-words. The benefit is that if you are searching for "Bachelor's degree", you want to know if the word number of "degree" is one greater than the word number of "Bachelor's". Of course, that is not bullet proof, since it doesn't protect against any of the words appearing more than once, masking detection of the proper sequence. Code can be written to avoid this, but this code is a little more self-evident. The third argument of FINDW ( ' .') tells the function that only those two characters are word delimiters.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.