05-05-2015 05:44 AM
The date of birth in our current database has 25% missing ...I would like to predict the missing usng first name etc..I heard it gives a good predition?
Could anyone help with this? How is it done in SAS! What modelling technique do we use here?
Your help would be much appreciated
05-05-2015 06:40 AM
Never heard of such a thing, I see there is an R package for guessing the Gender via name, but couldn't find anything on age. Don't see how it would work anyway. Why not just assign them a random age within certain group ranges if you have to have age, or infer it from other data, e.g. they had an "xyz procedure" at this date, so they would be > 18 at that point etc. or they got a credit card at this date which indicates they were 18 at that point.
05-05-2015 09:37 AM
Are you using Enterprise Miner? If so, you can use the Impute node. Choose the Tree option:
Hope this helps!
05-05-2015 09:41 AM
One of the frustrations of a question like this, is that people will suggest methods to do this, without stopping to mention that the idea of predicting a birth date based on first name seems to make no sense at all.
For any prediction method to work, there must be some sort of "correlation" between the input and the output. Maybe there is some correlation that I don't know about, but at this time, I would advise the original poster to not do this at all. The original poster did say "I would like to predict the missing usng first name etc..I heard it gives a good predition?" but unless you can give us a reference, I think you're wasting your time.
05-05-2015 11:29 PM
There's a Wolfram Alpha topic that's relevant: https://www.wolframalpha.com/input/?i=name+William&lk=3 . The "Estimated Current Age Distribution" plot is enlightening - it bears out PaigeMiller's advice - the slight correlation that's visible is far too loose to justify using this idea.
05-06-2015 12:21 PM
An added complication is, at least in many areas of the United States, parents have been attempting to name children with "unique" names. Below is a selected list of girl's names from a recent 10 year period. Careful reading will show that many of these names are somewhat phonetically equivalent to relatively common names, Aeryka <=> Erica for example.
August Star,Divinity,Jurnee,Nutaliay Harmoney,Surreal
05-06-2015 10:51 AM
That is a great suggestion and a well-founded, scalable, and contemporary method for addressing missing values in a predictive model. The idea is that a decision tree will use patterns detected from *all* the variables - which may not be obvious to us, e.g. 2-way correlations - to predict the missing value for each observation.
Several other best practices for handling missing values include:
1. Simply leaving the missing values in the data and using a decision tree or an ensemble of decision trees (i.e. random forest and/or gradient boosting) as your final predictive model.
Decision trees handle missing values at least 2 different ways:
--- In training they can group missing values in bins by themselves or along with other values of a variable, and use missing values to build the predictive model.
--- Surrogate rules: decision trees can use a variable like "State" to make a decision about a variable like "ZipCode" if it encounters a missing value for "ZipCode".
2. Impute the missing values however you like but retain a binary missing value indicator variable, so that missingness can be used to help make your final predictions.
Hope that helps.