@Jarvin99 wrote:
Hi,
Thank you so much for your code. I managed to solve my problem using the following codes:
data education;
set org.na_dir_profile_education;
if findw(upcase(qualification), 'BACHELOR')>=1 then bachelor = 1;
if findw(upcase(qualification), "BACHELOR'S DEGREE")>=1 then bachelor = 1;
if findw(upcase(qualification), 'BS')>=1 then bachelor = 1;
... many lines deleted ... if findw(upcase(qualification), 'BInfTech')>=1 then bachelor = 1;
if findw(upcase(qualification), 'BJ')>=1 then bachelor = 1;
run;
May I know if there is simplified way like writing a macro to shorten the above, as there are at least 30-40 qualified words for my bachelor indicator?
In addition to the comments others have made about using ELSE IF construct's to avoid doing superfluous IF test's, you should consider using a _TEMPORARY array of the search terms as a code-saving device, as in:
data education (drop=i _:);
set org.na_dir_profile_education;
array _text_ba {10} $20 _temporary_
('BACHELOR','BS','BSC','BE','BSE','BENG','BA',
'BAS','BASc','BAppSc')
bachelor=0;
do i=1 to dim(_text_ba) until (bachelor=1);
bachelor=(findw(_upcase_qual,trim(_unigrams_ba{i}))>=1);
end;
run;
The above ignores upper/lower case issues, which are easily addressed. More important is the issue of searching for two-word phrases ("bigrams" in this note. One-word phrases are unigrams). FINDW is not meant to find them. Here's a workaround, which divides your search terms into unigrams and bigrams:
data education (drop=i _:);
set org.na_dir_profile_education;
set have;
array _unigrams_ba {26} $20 _temporary_
('BACHELOR','BS','BSC','BE','BSE','BENG','BA',
'BAS','BASc','BAppSc','BArch','BBA','BBM','BBS','BCA',
'BCL','BCom','BComm','BCompt','BEc','BEcon','BEd','BFA',
'BInf','BInfTech','BJ');
array _bigrams_ba {6} $20 _temporary_
("BACHELOR'S DEGREE",'B Acc','B Arch','B.Acc','B.Math','B.Proc');
/*Make everything upper-case, for finding purposes */
_upcase_qual=upcase(qualification);
if _n_=1 then do;
do i=1 to dim(_unigrams_ba); unigrams_ba{i}=upcase(_unigrams_ba{i}); end;
do i=1 to dim(_bigrams_ba); bigrams_ba{i}=upcase(_bigrams_ba{i}); end;
end;
bachelor=0;
do i=1 to dim(_unigrams_ba) until (bachelor=1);
bachelor=(findw(_upcase_qual,trim(_unigrams_ba{i}))>=1);
end;
if bachelor=0 then do i=1 to dim(_bigrams_ba) until (bachelor=1);
_w1=findw(_upcase_qual,trim(scan(_bigrams_ba{i},1)),' .','e');
_w2=findw(_upcase_qual,trim(scan(_bigrams_ba{i},2)),' .','e');
bachelor=(_w2=_w1+1) and (_w1>0);
end;
run;
The "trick" here in dealing with bigrams is to use a feature of FINDW (the 'e' as the 4th parameter of FINDW) that returns the word-sequence number rather than the character position of a search-word inside a string-of-words. The benefit is that if you are searching for "Bachelor's degree", you want to know if the word number of "degree" is one greater than the word number of "Bachelor's". Of course, that is not bullet proof, since it doesn't protect against any of the words appearing more than once, masking detection of the proper sequence. Code can be written to avoid this, but this code is a little more self-evident. The third argument of FINDW ( ' .') tells the function that only those two characters are word delimiters.
... View more