BookmarkSubscribeRSS Feed
Jarvin99
Obsidian | Level 7

Hi,

I would say I am very new to SAS...so please kindly help me. I am struggling with a very simple task. I want to be able to identify people with a bachelor's degree. Below is the sample of the data:

people obtain_year qualification
Tom 1990 BA
Tom 2002 master
Tom 2010 MBA
Alice 1994Jan bachelor's degree
Alice 2011 EMBA
Lou 2005 BS
Lou 2006Jun01 MSc
Lou 2012 master of business administration

 

My desired outcome should create columns of indicators flagging who has bachelor, master, and MBA degrees, regardless of the uppercase and lowercase.

people obtain_year qualification bachelor master MBA
Tom 1990 BA 1    
Tom 2002 master   1  
Tom 2010 MBA     1
Alice 1994Jan bachelor's degree 1    
Alice 2011 EMBA      
Lou 2005 BS 1    
Lou 2006Jun01 MSc   1  
Lou 2012 master of business administration     1

 

I tried using 'findw' and 'index', but these functions will pick 'MBA' as well when I only want to include 'BA'. Bachelor degrees should be equal to 1 when the qualification contains the exact words: 'bachelor', 'BA', 'BS', or 'bachelor's degree'.

 

Thank you so much for your assistance in advance, as I really need to get this done soon in order for the next step of my research.

7 REPLIES 7
mkeintz
PROC Star

Please provide sample data in the form of a working DATA step, and the sas code you are using.   You'll get better advice, and (likely) tested code.

 

By the way, using FINDW (find word) to detect the unigram 'BA' should not give a false positive searching a string containing 'MBA'.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Patrick
Opal | Level 21

Code not tested because there was no ready-made sample data but should work.

data want;
  set have;
  length Batchelor Master MBA $1;
  if findw(qualification,'ba','it')>0             then Batchelor='1';
  else if findw(qualification,'bachelor','it')>0  then Batchelor='1';
  else if findw(qualification,'master','it')>0    then Master='1';
  else if findw(qualification,'mba','it')>0       then MBA='1';
run;

 

 

Jarvin99
Obsidian | Level 7

Hi,

Thank you so much for your code. I managed to solve my problem using the following codes:

data education; 
	set org.na_dir_profile_education;
	if findw(upcase(qualification), 'BACHELOR')>=1 then bachelor = 1;
	if findw(upcase(qualification), "BACHELOR'S DEGREE")>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BS')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BSC')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BE')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BSE')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BENG')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BA')>=1 then bachelor = 1;

	if findw(upcase(qualification), 'B Acc')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'B Arch')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'B.Acc')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'B.Math')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'B.Proc')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BAS')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BASc')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BAppSc')>=1 then bachelor = 1;

	if findw(upcase(qualification), 'BArch')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BBA')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BBM')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BBS')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BCA')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BCL')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BCom')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BComm')>=1 then bachelor = 1;

	if findw(upcase(qualification), 'BCompt')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BEc')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BEcon')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BEd')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BFA')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BInf')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BInfTech')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BJ')>=1 then bachelor = 1;
run;

May I know if there is simplified way like writing a macro to shorten the above, as there are at least 30-40 qualified words for my bachelor indicator?

 

Thank you.

Patrick
Opal | Level 21

 


@Jarvin99 wrote:

....
May I know if there is simplified way like writing a macro to shorten the above, as there are at least 30-40 qualified words for my bachelor indicator?


There might be options to write something dynamic but if and what really is suitable for your data depends on what you have and your search terms.

You were talking about searching for WORD (not strings that can be part of a word) but looking into your code there might be some challenges what defines a word delimiter.

 

For example based on below it appears both a blank and a fullstop can't always serve as word delimiter.

To create something dynamic the code logic would need to "know" when it can't use the normal word delimiters and what can serve as delimiter.

	if findw(upcase(qualification), 'B Acc')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'B Arch')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'B.Acc')>=1 then bachelor = 1;

If you have a series of IF statements that are exclusive then use ELSE IF. There is really no need to execute all the IF's after an expression returns true.

 

None of above 3 expressions will ever return TRUE because if not explicitly specified via a parameter findw() will use blank and fullstop as word delimiters and though can never find a term that contains such a word delimiter. You also don't need to upcase the source variable if you use the findw() "i" modifier that instructs the function to search case insensitive.

 

Patrick_0-1702460650097.png

characters in findw() let's you define explicitly what to use as word delimiters if you can't use the default.

 

I suggest that you read in the docu the details about the findw() function to fully understand what functionality is available and then investigate your search terms to identify the challenges.

 

 

Tom
Super User Tom
Super User

Note that for FINDW() (and other WORD finding functions) it is perfectly fine to search for "words" that contain delimiters.

 

So

findw(string,'Word1 word2',' ','i')

Will find WORD1 followed by a single space and WORD2 that is either at the beginning or end of the string or between two spaces.

ballardw
Super User

Perhaps:

data education; 
 set org.na_dir_profile_education;
 if upcase(qualification) in ( 'BACHELOR'
 "BACHELOR'S DEGREE" 'BS' 'BSC' 'BE' 'BSE' 'BENG' 'BA'
 'B ACC' 'B ARCH' 'B.ACC' 'B.MATH' 'B.PROC' 'BAS' 'BASC' 'BAPPSC' 
 'BARCH' 'BBA' 'BBM' 'BBS' 'BCA' 'BCL' 'BCOM' 'BCOMM' 
 'BCOMPT' 'BEC' 'BECON' 'BED' 'BFA' 'BINF' 'BINFTECH' 'BJ' 
 ) then bachelor=1;
run;

will reduce some of your coding since you apparently have WAY MORE values than your question stated.

Not to mention that several of your FINDW were looking for not upcased value in the search string.

 

Findw does have the i optional modifier to ignore case.

mkeintz
PROC Star

@Jarvin99  wrote: 

Hi,

Thank you so much for your code. I managed to solve my problem using the following codes:

data education; 
	set org.na_dir_profile_education;
	if findw(upcase(qualification), 'BACHELOR')>=1 then bachelor = 1;
	if findw(upcase(qualification), "BACHELOR'S DEGREE")>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BS')>=1 then bachelor = 1;
    ... many lines deleted ...
if findw(upcase(qualification), 'BInfTech')>=1 then bachelor = 1; if findw(upcase(qualification), 'BJ')>=1 then bachelor = 1; run;

May I know if there is simplified way like writing a macro to shorten the above, as there are at least 30-40 qualified words for my bachelor indicator?

 

In addition to the comments others have made about using ELSE IF construct's to avoid doing superfluous IF test's, you should consider using a _TEMPORARY array of the search terms as a code-saving device, as in:

 

 

data education (drop=i _:); 
	set org.na_dir_profile_education;

     array _text_ba {10} $20 _temporary_
       ('BACHELOR','BS','BSC','BE','BSE','BENG','BA',
        'BAS','BASc','BAppSc')

     bachelor=0;
     do i=1 to dim(_text_ba) until (bachelor=1);
       bachelor=(findw(_upcase_qual,trim(_unigrams_ba{i}))>=1);
     end;
run;

 

 

The above ignores upper/lower case issues, which are easily addressed.  More important is the issue of searching for two-word phrases ("bigrams" in this note.  One-word phrases are unigrams). FINDW is not meant to find them.  Here's a workaround, which divides your search terms into unigrams and bigrams:

 

 

data education (drop=i _:); 
	set org.na_dir_profile_education;
	set have;

     array _unigrams_ba {26} $20 _temporary_
       ('BACHELOR','BS','BSC','BE','BSE','BENG','BA',
        'BAS','BASc','BAppSc','BArch','BBA','BBM','BBS','BCA',
        'BCL','BCom','BComm','BCompt','BEc','BEcon','BEd','BFA',
        'BInf','BInfTech','BJ');

     array _bigrams_ba {6} $20 _temporary_
       ("BACHELOR'S DEGREE",'B Acc','B Arch','B.Acc','B.Math','B.Proc');

     /*Make everything upper-case, for finding purposes */
     _upcase_qual=upcase(qualification);
     if _n_=1 then do;
       do i=1 to dim(_unigrams_ba);  unigrams_ba{i}=upcase(_unigrams_ba{i}); end;
       do i=1 to dim(_bigrams_ba);   bigrams_ba{i}=upcase(_bigrams_ba{i}); end;
     end;

     bachelor=0;
     do i=1 to dim(_unigrams_ba) until (bachelor=1);
       bachelor=(findw(_upcase_qual,trim(_unigrams_ba{i}))>=1);
     end;

     if bachelor=0 then do i=1 to dim(_bigrams_ba) until (bachelor=1);
       _w1=findw(_upcase_qual,trim(scan(_bigrams_ba{i},1)),' .','e');
       _w2=findw(_upcase_qual,trim(scan(_bigrams_ba{i},2)),' .','e');
       bachelor=(_w2=_w1+1) and (_w1>0);          
     end;
run;

 

The "trick" here in dealing with bigrams is to use a feature of FINDW (the 'e' as the 4th parameter of FINDW) that returns the word-sequence number rather than the character position of a search-word inside a string-of-words.  The benefit is that if you are searching for "Bachelor's degree", you want to know if the word number of "degree" is one greater than the word number of "Bachelor's".   Of course, that is not bullet proof, since it doesn't protect against any of the words appearing more than once, masking detection of the proper sequence.  Code can be written to avoid this, but this code is a little more self-evident.     The third argument of FINDW  ( ' .') tells the function that only those two characters are word delimiters.    

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 908 views
  • 2 likes
  • 5 in conversation