BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
colabear
Obsidian | Level 7

 

I have a data set containing FIRST NAME and LAST NAME variables. In order to execute my program correctly, I need a clean dataset. But, the one I'm working with has enough issues to keep me busy for days. What I need for the program to work is to have only FN and LN without any special characters or spaces or suffixes. The table below shows what kinds of issues I am encountering. The asterisk indicates missing value. I would like to move probable names to the correct fields. 

 

FirstName

LastName

George

Washington Carver

George Washington

Carver

*

George W Carver

*

George W

George

Washington-Carver

George

WashingtonCarver

George

Carver, III

 

data testing1b;
set testing1a;
if ANYPUNCT (clientFirstName)=1 then do
	Fname1=scan (clientFirstName,1,", ");
	Fname2=scan (clientFirstName,2,", ");
	Fname3=scan (clientFirstName,3,", ");
	Fname4=scan (clientFirstName,4,", ");
	If Fname1='JR' or 'SR' or 'II' or 'III' or length(Fname1)=1 then Fname1=' ';
	If Fname2='JR' or 'SR' or 'II' or 'III' or length(Fname2)=1 then Fname2=' ';
	If Fname3='JR' or 'SR' or 'II' or 'III' or length(Fname3)=1 then Fname3=' ';
	If Fname4='JR' or 'SR' or 'II' or 'III' or length(Fname4)=1 then Fname4=' ';
end;

 

1 ACCEPTED SOLUTION

Accepted Solutions
colabear
Obsidian | Level 7

I solved the problem with the following code: 

data clean_fn;
set testing2b;
last_name = scan(FN,-1,' '); 
mi_name = scan(FN,-2,' ');
first_name = scan(FN,-3,' ');
keep FN LN last_name first_name mi_name;
run;

In this dataset, the FN variable contained first, middle and last name as a value. This worked out very well considering the size of the dataset. 

View solution in original post

6 REPLIES 6
colabear
Obsidian | Level 7

I am seeking advice on how to clean up the dataset. There are issues with FN and LN values which I need to resolve. The dataset contains 480,000 records. I need assistance finding an effectient code to resolve the issues. I will then be using this dataset to conduct a match through another SAS program. 

ballardw
Super User

Will your data be matched to a known data set of "clean" values or do you need to clean this without any outside comparison?

How many records are you dealing with? Some techniques that may work for a couple hundred names in a reasonable amount of time may not work if you're dealing with millions of records.

 

There are a number of search and replace function such as TRANWRD, TRANSLATE and COMPRESS that you could use to reduce some of this

For instanc instead of scanning and looking for JR in mulple places use Tranwrd;

data example;
   name='Smith, John JR';
   cleaname = tranwrd(name,'JR','');
run;

You can use compress to replace some obnoxious characters

 

data example;
   name= '-Smith, * John !';
   cleanname =  compress(name,'*|-/!');
run;
colabear
Obsidian | Level 7

Hi. Thank you for your response. The dataset I'm working with contains 480,000 records. About 10,000 appear to have issues with FN and LN variables having values in the incorrect column.

 

Yes, I am cleaning this dataset so it can be run through a matching program I created. I'll try your suggestions and see what happens. 

LinusH
Tourmaline | Level 20
To retain a good solution that is maintainable (is that word?) over time, you might want to check out SAS Data Management Studio.
Data never sleeps
colabear
Obsidian | Level 7

I solved the problem with the following code: 

data clean_fn;
set testing2b;
last_name = scan(FN,-1,' '); 
mi_name = scan(FN,-2,' ');
first_name = scan(FN,-3,' ');
keep FN LN last_name first_name mi_name;
run;

In this dataset, the FN variable contained first, middle and last name as a value. This worked out very well considering the size of the dataset. 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 898 views
  • 0 likes
  • 4 in conversation