My task is to match names from Data Set A to our SIS. I have no unique identifier in Data Set A. Respondents were simply asked to enter their student's full name. The student name in SIS is Last, First, Middle. Last and First are required fields. Middle is optional and may include two (or more) names separated by a space.
Data in SIS (examples):
Last First Middle
Doe John
Doe John Edward
Doe John Edward Sam
******
The data in Set A is freewheeling. Use your imagination.
Data in Set A :
Student
Jane Smith
Jane Lucy Smith
Jane
Jane Lucy Pat Smith
Jane Lucy Pat Smith IV
Jane Lucy Smith (Janey)
I counted the number of blanks in each obs, thinking I can create data sets based on how many blanks are in the name. For example, a data set named Blank1 would have all entries where student was of the form "John Doe", with no middle name given. I could match on first and last or first, middle and last depending on if I'm using Blank1, etc.
My overarching question: is there a better strategy to use to try and match as many as possible? I was not consulted on the front end to give an opinion on how to ask the questions in the survey.
/* DEFINE NAMES - F, M, L */
data setA2;
set A;
/* COUNT NUMBER OF BLANKS IN FULL NAME */
num_blanks=countc(trim(student),' ');
/* CREATE NEEDED FNAME LNAME VARS */
fname=scan(student,1,' ');
lname=scan(student,-1,' ');
/* EXTRACT MIDDLE NAME FROM STUDENTS WITH 2 BLANKS */
IF num_blanks=2 then do;
mname=scan(student,2,' ');
end;
/* STUDENTS WITH 2 MID NAMES IN POWER SCHOOL HAVE BOTH IN MID NAME FIELD W/SPACE BETWEEN */
IF num_blanks=3 then do;
mname1=scan(student,2,' ');
mname2=scan(student,3,' ');
mname=strip(mname1)||' '||strip(mname2);
end;
/* STUDENTS WITH 0 BLANKS -> SET ASIDE FOR NOW. STUD WITH 1,4,5 BLANKS > MATCH ON FNAME LNAME ONLY AND TRY TO MATCH */
if num_blanks=0 then setaside=1;
/* 6 blanks */
if num_blanks=6 then lname=scan(student,3,' ');
run;
proc freq;
table num_blanks;
run;
num_blanks Frequency Percent Cumulative
Frequency Cumulative
Percent
0 6 0.16 6 0.16
1 1031 27.35 1037 27.51
2 2466 65.41 3503 92.92
3 251 6.66 3754 99.58
4 10 0.27 3764 99.84
5 5 0.13 3769 99.97
6 1 0.03 3770 100.00
/* MATCHING WILL DEPEND ON WHETHER THEY HAVE MIDDLE NAME, AND IF IT IS ONE NAME OR TWO SEPARATED BY A SPACE */
data blank0 blank145 blank23;
set one;
if num_blanks=0 then output blank0;
if num_blanks in (1,4,5,6) then output blank1456;
if num_blanks in (2,3) then output blank23;
run;
... View more