match the exact same word

Jarvin99 · Posted 12-12-2023 10:38 PM

Hi,

I would say I am very new to SAS...so please kindly help me. I am struggling with a very simple task. I want to be able to identify people with a bachelor's degree. Below is the sample of the data:

people	obtain_year	qualification
Tom	1990	BA
Tom	2002	master
Tom	2010	MBA
Alice	1994Jan	bachelor's degree
Alice	2011	EMBA
Lou	2005	BS
Lou	2006Jun01	MSc
Lou	2012	master of business administration

My desired outcome should create columns of indicators flagging who has bachelor, master, and MBA degrees, regardless of the uppercase and lowercase.

people	obtain_year	qualification	bachelor	master	MBA
Tom	1990	BA	1
Tom	2002	master		1
Tom	2010	MBA			1
Alice	1994Jan	bachelor's degree	1
Alice	2011	EMBA
Lou	2005	BS	1
Lou	2006Jun01	MSc		1
Lou	2012	master of business administration			1

I tried using 'findw' and 'index', but these functions will pick 'MBA' as well when I only want to include 'BA'. Bachelor degrees should be equal to 1 when the qualification contains the exact words: 'bachelor', 'BA', 'BS', or 'bachelor's degree'.

Thank you so much for your assistance in advance, as I really need to get this done soon in order for the next step of my research.

mkeintz · Posted 12-12-2023 11:14 PM

Please provide sample data in the form of a working DATA step, and the sas code you are using. You'll get better advice, and (likely) tested code.

By the way, using FINDW (find word) to detect the unigram 'BA' should not give a false positive searching a string containing 'MBA'.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 12-13-2023 12:07 AM

Code not tested because there was no ready-made sample data but should work.

data want;
  set have;
  length Batchelor Master MBA $1;
  if findw(qualification,'ba','it')>0             then Batchelor='1';
  else if findw(qualification,'bachelor','it')>0  then Batchelor='1';
  else if findw(qualification,'master','it')>0    then Master='1';
  else if findw(qualification,'mba','it')>0       then MBA='1';
run;

Jarvin99 · Posted 12-13-2023 01:29 AM

Hi,

Thank you so much for your code. I managed to solve my problem using the following codes:

data education; 
	set org.na_dir_profile_education;
	if findw(upcase(qualification), 'BACHELOR')>=1 then bachelor = 1;
	if findw(upcase(qualification), "BACHELOR'S DEGREE")>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BS')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BSC')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BE')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BSE')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BENG')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BA')>=1 then bachelor = 1;

	if findw(upcase(qualification), 'B Acc')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'B Arch')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'B.Acc')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'B.Math')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'B.Proc')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BAS')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BASc')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BAppSc')>=1 then bachelor = 1;

	if findw(upcase(qualification), 'BArch')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BBA')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BBM')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BBS')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BCA')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BCL')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BCom')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BComm')>=1 then bachelor = 1;

	if findw(upcase(qualification), 'BCompt')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BEc')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BEcon')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BEd')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BFA')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BInf')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BInfTech')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BJ')>=1 then bachelor = 1;
run;

May I know if there is simplified way like writing a macro to shorten the above, as there are at least 30-40 qualified words for my bachelor indicator?

Thank you.

Patrick · Posted 12-13-2023 04:49 AM

@Jarvin99 wrote:

....
May I know if there is simplified way like writing a macro to shorten the above, as there are at least 30-40 qualified words for my bachelor indicator?

There might be options to write something dynamic but if and what really is suitable for your data depends on what you have and your search terms.

You were talking about searching for WORD (not strings that can be part of a word) but looking into your code there might be some challenges what defines a word delimiter.

For example based on below it appears both a blank and a fullstop can't always serve as word delimiter.

To create something dynamic the code logic would need to "know" when it can't use the normal word delimiters and what can serve as delimiter.

	if findw(upcase(qualification), 'B Acc')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'B Arch')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'B.Acc')>=1 then bachelor = 1;

If you have a series of IF statements that are exclusive then use ELSE IF. There is really no need to execute all the IF's after an expression returns true.

None of above 3 expressions will ever return TRUE because if not explicitly specified via a parameter findw() will use blank and fullstop as word delimiters and though can never find a term that contains such a word delimiter. You also don't need to upcase the source variable if you use the findw() "i" modifier that instructs the function to search case insensitive.

characters in findw() let's you define explicitly what to use as word delimiters if you can't use the default.

I suggest that you read in the docu the details about the findw() function to fully understand what functionality is available and then investigate your search terms to identify the challenges.

Tom · Posted 12-13-2023 11:25 AM

Note that for FINDW() (and other WORD finding functions) it is perfectly fine to search for "words" that contain delimiters.

So

findw(string,'Word1 word2',' ','i')

Will find WORD1 followed by a single space and WORD2 that is either at the beginning or end of the string or between two spaces.

ballardw · Posted 12-13-2023 11:12 AM

Perhaps:

data education; 
 set org.na_dir_profile_education;
 if upcase(qualification) in ( 'BACHELOR'
 "BACHELOR'S DEGREE" 'BS' 'BSC' 'BE' 'BSE' 'BENG' 'BA'
 'B ACC' 'B ARCH' 'B.ACC' 'B.MATH' 'B.PROC' 'BAS' 'BASC' 'BAPPSC' 
 'BARCH' 'BBA' 'BBM' 'BBS' 'BCA' 'BCL' 'BCOM' 'BCOMM' 
 'BCOMPT' 'BEC' 'BECON' 'BED' 'BFA' 'BINF' 'BINFTECH' 'BJ' 
 ) then bachelor=1;
run;

will reduce some of your coding since you apparently have WAY MORE values than your question stated.

Not to mention that several of your FINDW were looking for not upcased value in the search string.

Findw does have the i optional modifier to ignore case.

mkeintz · Posted 12-13-2023 04:23 PM

@Jarvin99 wrote:

Hi,

Thank you so much for your code. I managed to solve my problem using the following codes:
data education; 
	set org.na_dir_profile_education;
	if findw(upcase(qualification), 'BACHELOR')>=1 then bachelor = 1;
	if findw(upcase(qualification), "BACHELOR'S DEGREE")>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BS')>=1 then bachelor = 1;
    ... many lines deleted ...
	if findw(upcase(qualification), 'BInfTech')>=1 then bachelor = 1;
	if findw(upcase(qualification), 'BJ')>=1 then bachelor = 1;
run;
May I know if there is simplified way like writing a macro to shorten the above, as there are at least 30-40 qualified words for my bachelor indicator?

In addition to the comments others have made about using ELSE IF construct's to avoid doing superfluous IF test's, you should consider using a _TEMPORARY array of the search terms as a code-saving device, as in:

data education (drop=i _:); 
	set org.na_dir_profile_education;

     array _text_ba {10} $20 _temporary_
       ('BACHELOR','BS','BSC','BE','BSE','BENG','BA',
        'BAS','BASc','BAppSc')

     bachelor=0;
     do i=1 to dim(_text_ba) until (bachelor=1);
       bachelor=(findw(_upcase_qual,trim(_unigrams_ba{i}))>=1);
     end;
run;

The above ignores upper/lower case issues, which are easily addressed. More important is the issue of searching for two-word phrases ("bigrams" in this note. One-word phrases are unigrams). FINDW is not meant to find them. Here's a workaround, which divides your search terms into unigrams and bigrams:

data education (drop=i _:); 
	set org.na_dir_profile_education;
	set have;

     array _unigrams_ba {26} $20 _temporary_
       ('BACHELOR','BS','BSC','BE','BSE','BENG','BA',
        'BAS','BASc','BAppSc','BArch','BBA','BBM','BBS','BCA',
        'BCL','BCom','BComm','BCompt','BEc','BEcon','BEd','BFA',
        'BInf','BInfTech','BJ');

     array _bigrams_ba {6} $20 _temporary_
       ("BACHELOR'S DEGREE",'B Acc','B Arch','B.Acc','B.Math','B.Proc');

     /*Make everything upper-case, for finding purposes */
     _upcase_qual=upcase(qualification);
     if _n_=1 then do;
       do i=1 to dim(_unigrams_ba);  unigrams_ba{i}=upcase(_unigrams_ba{i}); end;
       do i=1 to dim(_bigrams_ba);   bigrams_ba{i}=upcase(_bigrams_ba{i}); end;
     end;

     bachelor=0;
     do i=1 to dim(_unigrams_ba) until (bachelor=1);
       bachelor=(findw(_upcase_qual,trim(_unigrams_ba{i}))>=1);
     end;

     if bachelor=0 then do i=1 to dim(_bigrams_ba) until (bachelor=1);
       _w1=findw(_upcase_qual,trim(scan(_bigrams_ba{i},1)),' .','e');
       _w2=findw(_upcase_qual,trim(scan(_bigrams_ba{i},2)),' .','e');
       bachelor=(_w2=_w1+1) and (_w1>0);          
     end;
run;

The "trick" here in dealing with bigrams is to use a feature of FINDW (the 'e' as the 4th parameter of FINDW) that returns the word-sequence number rather than the character position of a search-word inside a string-of-words. The benefit is that if you are searching for "Bachelor's degree", you want to know if the word number of "degree" is one greater than the word number of "Bachelor's". Of course, that is not bullet proof, since it doesn't protect against any of the words appearing more than once, masking detection of the proper sequence. Code can be written to avoid this, but this code is a little more self-evident. The third argument of FINDW ( ' .') tells the function that only those two characters are word delimiters.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

match the exact same word

Re: match the exact same word

Re: match the exact same word

Re: match the exact same word

Re: match the exact same word

Re: match the exact same word

Re: match the exact same word

Re: match the exact same word

Classroom Training Available!