BookmarkSubscribeRSS Feed
cgprog
Fluorite | Level 6

Hello:

 

I got involved with a task to match two datasets that baffles me to no end. I have two sample datasets.

The first dataset has the courses that students enrolled during the term

 

 

DS1

 

IDCOURSETERM
1Introduction to Chemistry1
1College Mathematics2
1Biological Systems 2
2College Algebra1
2Introduction to Chemistry1
3Communication and Rhetoric1
3College Algebra2
4Biological Systems1
4Communication and Rhetoric2

 

The second dataset has the visits in tutoring centers. Here I do not have course title I have only reason for tutoring.

 

DS2

IDTutoring SubjectTERM
1Chemistry1
1Algebra2
1Writing2
2Mathematics1
2Chemical Engineering2
3Writing1
3Algebra2
4Biology1

 

Using these two datasets I am trying to match courses with tutoring visits and end up with the following table:

IDCOURSETERMTutored_in TermTutored_in Course
1Introduction to Chemistry1YesYes
1College Mathematics2YesYes
1Biology2YesNo
2College Algebra1YesYes
2Introduction to Chemistry1NoNo
3Communication1YesYes
3College Algebra2YesYes
4Biological Systems1YesYes
4Communication and Rhetoric2NoNo

 

The criteria for matching to have this table are:

a) if tutoring reason has similarity with the title of the course then tutored in course = yes

a) if a student visited a tutoring center regardless the matching of the title of the course then tutored in term = yes

 

How you would go to get to this table?

 

Chris

 

5 REPLIES 5
Reeza
Super User
You need to define the rules for matching. How many courses do you have? How closely will the titles match? Is this representative of your actual data? If it's messier this becomes a lot more complex.
Your output has both Algebra and College Algebra, in a case like this, how do you know which to match to?
cgprog
Fluorite | Level 6

Hi Reeza,

 

Thank you for your message. You are right. I corrected the output to be college algebra.

 

Question: How many courses do you have?

Answer: I have a list of courses.... more than the ones in the list... so the list could be like this

 

1. Introduction to Chemistry

2. Chemical Engineering

3. College Algebra

4. Precalculus

5. Trigonometry

6. Introduction to Biology

7. Genetics

8. Biological Systems

9. Communication and Rhetoric

10. Communication and Rhetoric II

11. etc

12. etc...

....

100. Differential Equations

101. Humanities

 

Question: How closely will the titles match?

 

This is tricky...but at a more simple thought:

Answer: If there is a same word in the cells  between the two cells(i.e. Course and Tutoring Reason) then the two cells match 

              If there is a different but similar subject between the two cells and there is no other cell with similar word at the same term

              then the two cells (i.e. Course and Tutoring Reason) match

 

I also corrected the output table to have college algebra. I hope that now is less confusing.

 

Thank you so very much for those questions

 

 

cgprog
Fluorite | Level 6

Thank you for your insightful questions. I apologize of the confusion

 

a) If there is a same word in the cells (i.e. Course and Tutoring Reason) between then the two cells match
b) If there is a different but similar subject between the two cells and there is no other cell with similar word at the same term then the two cells (i.e. Course and Tutoring Reason) match.

 

The available courses are more than the ones in the first dataset. It includes those in the table and others.

 

I corrected also the output (i.e. third table) to have everywhere college algebra.

 

I hope now is a bit less confusing.

 

 

PGStats
Opal | Level 21

You could start by building a table of matches. That process cannot be fully automated, but a good starting point could be:

 

proc sql;
create table matches as
select course, subject
from ds1, ds2
where course contains substr(scan(subject,1),1,5) 
union
select course, subject
from ds1, ds2
where countw(subject) > 1 and course contains substr(scan(subject,2),1,5);
quit;

you will need to add and remove some matches in that table, but hopefully that query should save you some time.

PG
cgprog
Fluorite | Level 6

Thank you PGStats that is a helpful start

 

I was wondering If I can use in this scenario any of those functions SOUNDEX, COMPGED, COMPLEV or SPEDIS functions to apply fuzzy matching between DS1(course) and DS2(tutoring_reason)… but I am not sure how can this be feasible.

 

Is anyone familiar with those functions? Can they be useful in this case?

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1778 views
  • 2 likes
  • 3 in conversation