Hi All
The date of birth in our current database has 25% missing ...I would like to predict the missing usng first name etc..I heard it gives a good predition?
Could anyone help with this? How is it done in SAS! What modelling technique do we use here?
Your help would be much appreciated
Many Thanks
Never heard of such a thing, I see there is an R package for guessing the Gender via name, but couldn't find anything on age. Don't see how it would work anyway. Why not just assign them a random age within certain group ranges if you have to have age, or infer it from other data, e.g. they had an "xyz procedure" at this date, so they would be > 18 at that point etc. or they got a credit card at this date which indicates they were 18 at that point.
Are you using Enterprise Miner? If so, you can use the Impute node. Choose the Tree option:
Hope this helps!
Ray
One of the frustrations of a question like this, is that people will suggest methods to do this, without stopping to mention that the idea of predicting a birth date based on first name seems to make no sense at all.
For any prediction method to work, there must be some sort of "correlation" between the input and the output. Maybe there is some correlation that I don't know about, but at this time, I would advise the original poster to not do this at all. The original poster did say "I would like to predict the missing usng first name etc..I heard it gives a good predition?" but unless you can give us a reference, I think you're wasting your time.
There's a Wolfram Alpha topic that's relevant: https://www.wolframalpha.com/input/?i=name+William&lk=3 . The "Estimated Current Age Distribution" plot is enlightening - it bears out PaigeMiller's advice - the slight correlation that's visible is far too loose to justify using this idea.
An added complication is, at least in many areas of the United States, parents have been attempting to name children with "unique" names. Below is a selected list of girl's names from a recent 10 year period. Careful reading will show that many of these names are somewhat phonetically equivalent to relatively common names, Aeryka <=> Erica for example.
Acacia,Chili,Indigodawn,Mem' Ree,Secret-Destiny
Aeryka,Cinamin,Indyana,Memphis,Serendipity
Alaska,Clarity,Infiniti,Mesa,Shasta
Alastrionna,Clarixxa,Innocence,Miami,Shasta-Rain
Allyvia,Cloketta,Integrity,Mishalyn,Sha'uri
Alpine,Creedance,Isabow,Modiesty-Star,Shy
Ambrosia,Crimson,Itali,Monet,Sicily
Americus,Cymmetry,Ixzy,Montana,Silver
Amnesty,Cynica,Izrayelle,Mysticque,Sincere
Anakalia,Daiquiri,Jaazminh,Nature,Snow
Angelic,Daytona,Jetta,Nautica,Sonrisa
Aptisam,Dayzee,Jewleah,Navy,Soul
Aquilla,Dazsha,Jorja,Nirvana,Sparrow
Arbor,Denym,Jubilee,Normandie,Starlit
Arlington,Diamonique,Juniper,Northstar,Sublym
Ataree,Diligence,Jupiter,Noxx,Supernova
August Star,Divinity,Jurnee,Nutaliay Harmoney,Surreal
Auktober,Dorcas,Jynnjer,Octayvia,Symphony
Auroaramackay,Draekli,Kahlua,Olive,Tacheranai
Autumn Hunnie,Dream,Kalispell,Otila,Tanaquil
Autym,Dublin,Kanyon,Oyuky,Tehyanatane
Aveda,Dymand,Karizma,Pallas,Teighlor
Beatriz,Eboneyrose,Kaskade,Pandora,Tennessee
Beautifull,Ecstacy,Kazpyr,Payshence,Thyme
Belphoebe,Eeleceya,Kezzi,Peaches,Tottie
Berlyn,Elexious,Khlover,Pennilane Meadow,Tragen
Berthaalicia,England,Kiffin,Pepper,Tricity
Bicardi,Envy,Klowie,Perfect,Trulie
Blayde,Eos,Kozmo,Persephone,Trynadee
Blessin,Epiphany,Krickette,Phaedra,Tsunami
Blyss,Essence,Kronic,Poet,Tuesday-Rain
Boisen,Eternytie,Krymsun,Poppy,Tundra
Braenwynne,Fall,Kwincee,Prairie,Tyranny
Breeze,Fancee,Lala,Pranaleyadri,Ugonna
Bristol,Fashion,Lavender,Pranathi,Uneike
Brittanica,Fayble,Lectra,Promise,Utahnna
Brixx,Fayte,Lexington,Qatira,Vegas
Brizzbin,Fennel,Libbertie,Quietstorm,Velicydee
Brookenzie,Flossianna,Libertyann,Quimby,Viktoriya
Burgandee,Freedom,Licet,Rainger,Wajd
Byainett,Goldie-Moon,Little Summers,Ravenbella,Wrandie
Byrkli,Graceland,Lixy,Rebel-Ann,Wyntre
Cabella,Gyzzelle,London,Remedy,Xerenity
Cachet,Hadies,Lotus,Remmington,Yafa
California,Hailo,Love,Remzije,Ynfiniti
Calloway,Happy,Lux,Reverie,Yochabelle
Calypso,Harlequin,Lybburtie,Rhuivnyin,Zepplyn
Capreece,Heaven,Magnolia,Russia,Zipaya
Cascade,Hella,Maitre,Saig,Zipporah
Cassiopeia,Hermyanie,Mali,Saylor,Zoigh
Catalina,Hero,Malybu,Sã-kõ-yã,Zuzu
Cedar,Heziachiah,Manhattan,Saoirse,
Cedee,Holland,Maplejo,Sapphire,
Celestial,Honesty,Mavity,Sativalyn,
Celtic,Icelynn,Mayte,Season,
Charm,Immaculata,McCall,Seattle,
Ray,
That is a great suggestion and a well-founded, scalable, and contemporary method for addressing missing values in a predictive model. The idea is that a decision tree will use patterns detected from *all* the variables - which may not be obvious to us, e.g. 2-way correlations - to predict the missing value for each observation.
Several other best practices for handling missing values include:
1. Simply leaving the missing values in the data and using a decision tree or an ensemble of decision trees (i.e. random forest and/or gradient boosting) as your final predictive model.
Decision trees handle missing values at least 2 different ways:
--- In training they can group missing values in bins by themselves or along with other values of a variable, and use missing values to build the predictive model.
--- Surrogate rules: decision trees can use a variable like "State" to make a decision about a variable like "ZipCode" if it encounters a missing value for "ZipCode".
2. Impute the missing values however you like but retain a binary missing value indicator variable, so that missingness can be used to help make your final predictions.
Hope that helps.
You could use databases like this:
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.