SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
GreggB
Pyrite | Level 9

My task is to match  names from Data Set A to our SIS. I have no unique identifier in Data Set A. Respondents were simply asked to enter their student's full name. The student name in SIS  is Last, First, Middle. Last and First are required fields. Middle is optional and may include two (or more) names separated by a space.

Data in SIS (examples):

Last     First     Middle

Doe    John

Doe    John    Edward

Doe    John    Edward Sam

******

The data in Set A is freewheeling. Use your imagination.

Data in Set A :

Student

Jane Smith

Jane Lucy Smith

Jane

Jane Lucy Pat Smith

Jane Lucy Pat Smith IV

Jane Lucy Smith (Janey)

 

I counted the number of blanks in each obs, thinking I can create data sets based on how many blanks are in the name. For example, a data set named Blank1 would have all entries where student was of the form "John Doe", with no middle name given.   I could match on first and last or first, middle and last depending on if I'm using Blank1, etc.

My overarching question: is there a better strategy to use to try and match as many as possible? I was not consulted on the front end to give an opinion on how to ask the questions in the survey. 

/* DEFINE NAMES - F, M, L  */
data setA2;
	set A;	
	/* COUNT NUMBER OF BLANKS IN FULL NAME */
	num_blanks=countc(trim(student),' ');
	/* CREATE NEEDED FNAME LNAME VARS */
	fname=scan(student,1,' ');
	lname=scan(student,-1,' ');
	/* EXTRACT MIDDLE NAME FROM STUDENTS WITH 2 BLANKS */
	IF num_blanks=2  then do;
		mname=scan(student,2,' ');
	end;
	/* STUDENTS WITH 2 MID NAMES IN POWER SCHOOL HAVE BOTH IN MID NAME FIELD W/SPACE BETWEEN */
	IF num_blanks=3  then do;
	mname1=scan(student,2,' ');
	mname2=scan(student,3,' ');
	mname=strip(mname1)||' '||strip(mname2);
	end;
	/* STUDENTS WITH 0 BLANKS -> SET ASIDE FOR NOW. STUD WITH 1,4,5 BLANKS > MATCH ON FNAME LNAME ONLY AND TRY TO MATCH */
	if num_blanks=0 then setaside=1;
	/* 6 blanks */
	if num_blanks=6 then lname=scan(student,3,' ');
run;
proc freq;
table num_blanks;
run;
num_blanks Frequency Percent Cumulative
Frequency Cumulative
Percent 
0 6 0.16 6 0.16 
1 1031 27.35 1037 27.51 
2 2466 65.41 3503 92.92 
3 251 6.66 3754 99.58 
4 10 0.27 3764 99.84 
5 5 0.13 3769 99.97 
6 1 0.03 3770 100.00 
/* MATCHING WILL DEPEND ON WHETHER THEY HAVE MIDDLE NAME, AND IF IT IS ONE NAME OR TWO SEPARATED BY A SPACE */
data blank0 blank145 blank23;
	set one;
	if num_blanks=0 then output blank0;
	if num_blanks in (1,4,5,6) then output blank1456;
	if num_blanks in (2,3) then output blank23;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
GreggB
Pyrite | Level 9

I read some about COMPGED function but never used it. If you have some favorite documentation on it please share. Thanks. 

View solution in original post

2 REPLIES 2
s_lassen
Meteorite | Level 14

Rather than count the number of blanks, I would use the COUNTW function (count of words). In most cases the result may be the same, but with what you call "freewheeling" data, there is bound to be some records with two or more blanks between words. Depending on the source, you may get some records where e.g. a <tab> character is used instead of a blank to separate words. Again, COUNTW can handle that sensibly.

 

Other than that, I think you have your work cut out for you; cleansing and matching records like that can take a lot of time. Obviously, I would start getting rid of the nicknames in parantheses (Janey, in your example), and perhaps also look for commas, which may indicate that the last name comes first, as in "Smith, Jane Lucy".

 

I have done similar stuff once in a while, when the going gets tough and there is a lot of stuff that kind of matches (spelling errors!) the COMPGED function can come in handy; you may have to adapt the score calculation to your needs using CALL COMPCOST. But of course it depends on how many students you are dealing with, probably not as many as a million (which I did once, for an insurance company, trying to match names and addresses to official registers). Sometimes it is easier to get the most obvious fits, and then do the last 20 to 50 manually (your SETASIDE category).

GreggB
Pyrite | Level 9

I read some about COMPGED function but never used it. If you have some favorite documentation on it please share. Thanks. 

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 1373 views
  • 1 like
  • 2 in conversation