I have a data set with job titles and job codes that are 1 to 8 digits long. I have another data set with a study population that have job codes from 10 to 12 digits long. The first 1 to 8 digits will match the job codes from the first data set. My plan is to write code to match the first 8 digits. If it matches, done. If not, match the first 7 digits. If it matches, done. If not, match the first 6 digits. And keep going until it's done. My question is how to write this in SAS. Would the "or" operator achieve this way of matching on the highest number digits, stopping if it finds a match, and looking for the next number of digits if not? Should the code look like this?
proc sql;
create table job_titles as
select A.*, B.job_title
from study_population as A
left join occupation_data as B
on A.industry = B.industry
and A.seniority = A.seniority
and (upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,8))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,7))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,6))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,5))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,4))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,3))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,2))
or upcase(A.job_code) = upcase(substr(B.Occupation_Code,1,1)) );
quit;
You'll notice that I am matching by industry and seniority first because the code will depend on those 2 things as well.
and (upcase(A.job_code) =: upcase(substr(B.Occupation_Code,1,8))
I think you can do just that in SAS (colon addition) but not 100% sure it works in SQL. Definitely test it out.
=: does not work in SQL, but eqt does work in SQL to see if the start of strings match.
To @daufoi why do you need the upcase() function if the strings are digits?
... and a.job_code eqt b.occupation_code
So you want to find the best match. In that case, you will have to calculate the quality of the match in an inner query (code not tested):
proc sql;
create table job_titles as
select *
from(
select A.*,b.Jobtitle,
case
when upcase(substr(A.jobcode,1,8))=upcase(substr(B.Occupation_Code,1,8)) then 8
when upcase(substr(A.jobcode,1,7))=upcase(substr(B.Occupation_Code,1,7)) then 7
when upcase(substr(A.jobcode,1,6))=upcase(substr(B.Occupation_Code,1,6)) then 6
when upcase(substr(A.jobcode,1,5))=upcase(substr(B.Occupation_Code,1,5)) then 5
when upcase(substr(A.jobcode,1,4))=upcase(substr(B.Occupation_Code,1,4)) then 4
when upcase(substr(A.jobcode,1,3))=upcase(substr(B.Occupation_Code,1,3)) then 3
when upcase(substr(A.jobcode,1,2))=upcase(substr(B.Occupation_Code,1,2)) then 2
when upcase(substr(A.jobcode,1,1))=upcase(substr(B.Occupation_Code,1,1)) then 1
else 0
end as match
from study_population as A
left join occupation_data as B
on A.industry = B.industry
and A.seniority = A.seniority
)
where match>0
group by industry,seniority
having match=max(match)
;
quit;
This will also put the MATCH column on the output, if you do not want that you will have to name the columns explicitly in the outer query, or use a DROP= option on the output table.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.