- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
My task is to match names from Data Set A to our SIS. I have no unique identifier in Data Set A. Respondents were simply asked to enter their student's full name. The student name in SIS is Last, First, Middle. Last and First are required fields. Middle is optional and may include two (or more) names separated by a space.
Data in SIS (examples):
Last First Middle
Doe John
Doe John Edward
Doe John Edward Sam
******
The data in Set A is freewheeling. Use your imagination.
Data in Set A :
Student
Jane Smith
Jane Lucy Smith
Jane
Jane Lucy Pat Smith
Jane Lucy Pat Smith IV
Jane Lucy Smith (Janey)
I counted the number of blanks in each obs, thinking I can create data sets based on how many blanks are in the name. For example, a data set named Blank1 would have all entries where student was of the form "John Doe", with no middle name given. I could match on first and last or first, middle and last depending on if I'm using Blank1, etc.
My overarching question: is there a better strategy to use to try and match as many as possible? I was not consulted on the front end to give an opinion on how to ask the questions in the survey.
/* DEFINE NAMES - F, M, L */
data setA2;
set A;
/* COUNT NUMBER OF BLANKS IN FULL NAME */
num_blanks=countc(trim(student),' ');
/* CREATE NEEDED FNAME LNAME VARS */
fname=scan(student,1,' ');
lname=scan(student,-1,' ');
/* EXTRACT MIDDLE NAME FROM STUDENTS WITH 2 BLANKS */
IF num_blanks=2 then do;
mname=scan(student,2,' ');
end;
/* STUDENTS WITH 2 MID NAMES IN POWER SCHOOL HAVE BOTH IN MID NAME FIELD W/SPACE BETWEEN */
IF num_blanks=3 then do;
mname1=scan(student,2,' ');
mname2=scan(student,3,' ');
mname=strip(mname1)||' '||strip(mname2);
end;
/* STUDENTS WITH 0 BLANKS -> SET ASIDE FOR NOW. STUD WITH 1,4,5 BLANKS > MATCH ON FNAME LNAME ONLY AND TRY TO MATCH */
if num_blanks=0 then setaside=1;
/* 6 blanks */
if num_blanks=6 then lname=scan(student,3,' ');
run;
proc freq;
table num_blanks;
run;
num_blanks Frequency Percent Cumulative
Frequency Cumulative
Percent
0 6 0.16 6 0.16
1 1031 27.35 1037 27.51
2 2466 65.41 3503 92.92
3 251 6.66 3754 99.58
4 10 0.27 3764 99.84
5 5 0.13 3769 99.97
6 1 0.03 3770 100.00
/* MATCHING WILL DEPEND ON WHETHER THEY HAVE MIDDLE NAME, AND IF IT IS ONE NAME OR TWO SEPARATED BY A SPACE */
data blank0 blank145 blank23;
set one;
if num_blanks=0 then output blank0;
if num_blanks in (1,4,5,6) then output blank1456;
if num_blanks in (2,3) then output blank23;
run;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I read some about COMPGED function but never used it. If you have some favorite documentation on it please share. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Rather than count the number of blanks, I would use the COUNTW function (count of words). In most cases the result may be the same, but with what you call "freewheeling" data, there is bound to be some records with two or more blanks between words. Depending on the source, you may get some records where e.g. a <tab> character is used instead of a blank to separate words. Again, COUNTW can handle that sensibly.
Other than that, I think you have your work cut out for you; cleansing and matching records like that can take a lot of time. Obviously, I would start getting rid of the nicknames in parantheses (Janey, in your example), and perhaps also look for commas, which may indicate that the last name comes first, as in "Smith, Jane Lucy".
I have done similar stuff once in a while, when the going gets tough and there is a lot of stuff that kind of matches (spelling errors!) the COMPGED function can come in handy; you may have to adapt the score calculation to your needs using CALL COMPCOST. But of course it depends on how many students you are dealing with, probably not as many as a million (which I did once, for an insurance company, trying to match names and addresses to official registers). Sometimes it is easier to get the most obvious fits, and then do the last 20 to 50 manually (your SETASIDE category).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I read some about COMPGED function but never used it. If you have some favorite documentation on it please share. Thanks.