Hello:
I got involved with a task to match two datasets that baffles me to no end. I have two sample datasets.
The first dataset has the courses that students enrolled during the term
DS1
ID | COURSE | TERM |
1 | Introduction to Chemistry | 1 |
1 | College Mathematics | 2 |
1 | Biological Systems | 2 |
2 | College Algebra | 1 |
2 | Introduction to Chemistry | 1 |
3 | Communication and Rhetoric | 1 |
3 | College Algebra | 2 |
4 | Biological Systems | 1 |
4 | Communication and Rhetoric | 2 |
The second dataset has the visits in tutoring centers. Here I do not have course title I have only reason for tutoring.
DS2
ID | Tutoring Subject | TERM |
1 | Chemistry | 1 |
1 | Algebra | 2 |
1 | Writing | 2 |
2 | Mathematics | 1 |
2 | Chemical Engineering | 2 |
3 | Writing | 1 |
3 | Algebra | 2 |
4 | Biology | 1 |
Using these two datasets I am trying to match courses with tutoring visits and end up with the following table:
ID | COURSE | TERM | Tutored_in Term | Tutored_in Course |
1 | Introduction to Chemistry | 1 | Yes | Yes |
1 | College Mathematics | 2 | Yes | Yes |
1 | Biology | 2 | Yes | No |
2 | College Algebra | 1 | Yes | Yes |
2 | Introduction to Chemistry | 1 | No | No |
3 | Communication | 1 | Yes | Yes |
3 | College Algebra | 2 | Yes | Yes |
4 | Biological Systems | 1 | Yes | Yes |
4 | Communication and Rhetoric | 2 | No | No |
The criteria for matching to have this table are:
a) if tutoring reason has similarity with the title of the course then tutored in course = yes
a) if a student visited a tutoring center regardless the matching of the title of the course then tutored in term = yes
How you would go to get to this table?
Chris
Hi Reeza,
Thank you for your message. You are right. I corrected the output to be college algebra.
Question: How many courses do you have?
Answer: I have a list of courses.... more than the ones in the list... so the list could be like this
1. Introduction to Chemistry
2. Chemical Engineering
3. College Algebra
4. Precalculus
5. Trigonometry
6. Introduction to Biology
7. Genetics
8. Biological Systems
9. Communication and Rhetoric
10. Communication and Rhetoric II
11. etc
12. etc...
....
100. Differential Equations
101. Humanities
Question: How closely will the titles match?
This is tricky...but at a more simple thought:
Answer: If there is a same word in the cells between the two cells(i.e. Course and Tutoring Reason) then the two cells match
If there is a different but similar subject between the two cells and there is no other cell with similar word at the same term
then the two cells (i.e. Course and Tutoring Reason) match
I also corrected the output table to have college algebra. I hope that now is less confusing.
Thank you so very much for those questions
Thank you for your insightful questions. I apologize of the confusion
a) If there is a same word in the cells (i.e. Course and Tutoring Reason) between then the two cells match
b) If there is a different but similar subject between the two cells and there is no other cell with similar word at the same term then the two cells (i.e. Course and Tutoring Reason) match.
The available courses are more than the ones in the first dataset. It includes those in the table and others.
I corrected also the output (i.e. third table) to have everywhere college algebra.
I hope now is a bit less confusing.
You could start by building a table of matches. That process cannot be fully automated, but a good starting point could be:
proc sql;
create table matches as
select course, subject
from ds1, ds2
where course contains substr(scan(subject,1),1,5)
union
select course, subject
from ds1, ds2
where countw(subject) > 1 and course contains substr(scan(subject,2),1,5);
quit;
you will need to add and remove some matches in that table, but hopefully that query should save you some time.
Thank you PGStats that is a helpful start
I was wondering If I can use in this scenario any of those functions SOUNDEX, COMPGED, COMPLEV or SPEDIS functions to apply fuzzy matching between DS1(course) and DS2(tutoring_reason)… but I am not sure how can this be feasible.
Is anyone familiar with those functions? Can they be useful in this case?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.