BookmarkSubscribeRSS Feed
Fluorite | Level 6

## matching on the highest number digits, stopping if it finds a match

I have a data set with job titles and job codes that are 1 to 8 digits long. I have another data set with a study population that have job codes from 10 to 12 digits long. The first 1 to 8 digits will match the job codes from the first data set. My plan is to write code to match the first 8 digits. If it matches, done. If not, match the first 7 digits. If it matches, done. If not, match the first 6 digits. And keep going until it's done. My question is how to write this in SAS. Would the "or" operator achieve this way of matching on the highest number digits, stopping if it finds a match, and looking for the next number of digits if not? Should the code look like this?

``````proc sql;
create table job_titles as
select A.*, B.job_title
from study_population as A
left join occupation_data as B
on A.industry = B.industry
and A.seniority = A.seniority
and (upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,8))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,7))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,6))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,5))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,4))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,3))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,2))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,1)) );
quit;``````

You'll notice that I am matching by industry and seniority first because the code will depend on those 2 things as well.

3 REPLIES 3
Super User

## Re: matching on the highest number digits, stopping if it finds a match

``````	and (upcase(A.job_code) =: upcase(substr(B.Occupation_Code,1,8))
``````

I think you can do just that in SAS (colon addition) but not 100% sure it works in SQL. Definitely test it out.

Diamond | Level 26

## Re: matching on the highest number digits, stopping if it finds a match

=: does not work in SQL, but eqt does work in SQL to see if the start of strings match.

To @daufoi why do you need the upcase() function if the strings are digits?

``... and a.job_code eqt b.occupation_code``
--
Paige Miller
Meteorite | Level 14

## Re: matching on the highest number digits, stopping if it finds a match

So you want to find the best match. In that case, you will have to calculate the quality of the match in an inner query (code not tested):

``````proc sql;
create table job_titles as
select *
from(
select A.*,b.Jobtitle,
case
when upcase(substr(A.jobcode,1,8))=upcase(substr(B.Occupation_Code,1,8)) then 8
when upcase(substr(A.jobcode,1,7))=upcase(substr(B.Occupation_Code,1,7)) then 7
when upcase(substr(A.jobcode,1,6))=upcase(substr(B.Occupation_Code,1,6)) then 6
when upcase(substr(A.jobcode,1,5))=upcase(substr(B.Occupation_Code,1,5)) then 5
when upcase(substr(A.jobcode,1,4))=upcase(substr(B.Occupation_Code,1,4)) then 4
when upcase(substr(A.jobcode,1,3))=upcase(substr(B.Occupation_Code,1,3)) then 3
when upcase(substr(A.jobcode,1,2))=upcase(substr(B.Occupation_Code,1,2)) then 2
when upcase(substr(A.jobcode,1,1))=upcase(substr(B.Occupation_Code,1,1)) then 1
else 0
end as match
from study_population as A
left join occupation_data as B
on A.industry = B.industry
and A.seniority = A.seniority
)
where match>0
group by industry,seniority
having match=max(match)
;
quit;``````

This will also put the MATCH column on the output, if you do not want that you will have to name the columns explicitly in the outer query, or use a DROP= option on the output table.

Discussion stats
• 3 replies
• 319 views
• 2 likes
• 4 in conversation