BookmarkSubscribeRSS Feed
Kanyange
Fluorite | Level 6

Hi All

The date of birth in our current database has 25% missing ...I  would like to predict the missing usng first name etc..I heard it gives a good predition?

Could anyone help with this? How is it done in SAS! What modelling technique do we use here?

Your help would be much appreciated

Many Thanks

7 REPLIES 7
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Never heard of such a thing, I see there is an R package for guessing the Gender via name, but couldn't find anything on age.  Don't see how it would work anyway.  Why not just assign them a random age within certain group ranges if you have to have age, or infer it from other data, e.g. they had an "xyz procedure" at this date, so they would be > 18 at that point etc. or they got a credit card at this date which indicates they were 18 at that point.

rayIII
SAS Employee

Are you using Enterprise Miner? If so, you can use the Impute node. Choose the Tree option:

    • Tree — Use the Tree setting to replace missing interval variable values with replacement values that are estimated by analyzing each input as a target. The remaining input and rejected variables are used as predictors. Use the Variables window to edit the status of the input variables. Variables that have a model role of target cannot be used to impute the data. Because the imputed value for each input variable is based on the other input variables, this imputation technique may be more accurate than simply using the variable mean or median to replace the missing tree values.          

Hope this helps!

Ray

PaigeMiller
Diamond | Level 26

One of the frustrations of a question like this, is that people will suggest methods to do this, without stopping to mention that the idea of predicting a birth date based on first name seems to make no sense at all.

For any prediction method to work, there must be some sort of "correlation" between the input and the output. Maybe there is some correlation that I don't know about, but at this time, I would advise the original poster to not do this at all. The original poster did say "I would like to predict the missing usng first name etc..I heard it gives a good predition?" but unless you can give us a reference, I think you're wasting your time.

--
Paige Miller
dkb
Quartz | Level 8 dkb
Quartz | Level 8

There's a Wolfram Alpha topic that's relevant: https://www.wolframalpha.com/input/?i=name+William&lk=3 . The "Estimated Current Age Distribution" plot is enlightening - it bears out PaigeMiller's advice - the slight correlation that's visible is far too loose to justify using this idea.

ballardw
Super User

An added complication is, at least in many areas of the United States, parents have been attempting to name children with "unique" names. Below is a selected list of girl's names from a recent 10 year period. Careful reading will show that many of these names are somewhat phonetically equivalent to relatively common names, Aeryka <=> Erica for example.

Acacia,Chili,Indigodawn,Mem' Ree,Secret-Destiny

Aeryka,Cinamin,Indyana,Memphis,Serendipity

Alaska,Clarity,Infiniti,Mesa,Shasta

Alastrionna,Clarixxa,Innocence,Miami,Shasta-Rain

Allyvia,Cloketta,Integrity,Mishalyn,Sha'uri

Alpine,Creedance,Isabow,Modiesty-Star,Shy

Ambrosia,Crimson,Itali,Monet,Sicily

Americus,Cymmetry,Ixzy,Montana,Silver

Amnesty,Cynica,Izrayelle,Mysticque,Sincere

Anakalia,Daiquiri,Jaazminh,Nature,Snow

Angelic,Daytona,Jetta,Nautica,Sonrisa

Aptisam,Dayzee,Jewleah,Navy,Soul

Aquilla,Dazsha,Jorja,Nirvana,Sparrow

Arbor,Denym,Jubilee,Normandie,Starlit

Arlington,Diamonique,Juniper,Northstar,Sublym

Ataree,Diligence,Jupiter,Noxx,Supernova

August Star,Divinity,Jurnee,Nutaliay Harmoney,Surreal

Auktober,Dorcas,Jynnjer,Octayvia,Symphony

Auroaramackay,Draekli,Kahlua,Olive,Tacheranai

Autumn Hunnie,Dream,Kalispell,Otila,Tanaquil

Autym,Dublin,Kanyon,Oyuky,Tehyanatane

Aveda,Dymand,Karizma,Pallas,Teighlor

Beatriz,Eboneyrose,Kaskade,Pandora,Tennessee

Beautifull,Ecstacy,Kazpyr,Payshence,Thyme

Belphoebe,Eeleceya,Kezzi,Peaches,Tottie

Berlyn,Elexious,Khlover,Pennilane Meadow,Tragen

Berthaalicia,England,Kiffin,Pepper,Tricity

Bicardi,Envy,Klowie,Perfect,Trulie

Blayde,Eos,Kozmo,Persephone,Trynadee

Blessin,Epiphany,Krickette,Phaedra,Tsunami

Blyss,Essence,Kronic,Poet,Tuesday-Rain

Boisen,Eternytie,Krymsun,Poppy,Tundra

Braenwynne,Fall,Kwincee,Prairie,Tyranny

Breeze,Fancee,Lala,Pranaleyadri,Ugonna

Bristol,Fashion,Lavender,Pranathi,Uneike

Brittanica,Fayble,Lectra,Promise,Utahnna

Brixx,Fayte,Lexington,Qatira,Vegas

Brizzbin,Fennel,Libbertie,Quietstorm,Velicydee

Brookenzie,Flossianna,Libertyann,Quimby,Viktoriya

Burgandee,Freedom,Licet,Rainger,Wajd

Byainett,Goldie-Moon,Little Summers,Ravenbella,Wrandie

Byrkli,Graceland,Lixy,Rebel-Ann,Wyntre

Cabella,Gyzzelle,London,Remedy,Xerenity

Cachet,Hadies,Lotus,Remmington,Yafa

California,Hailo,Love,Remzije,Ynfiniti

Calloway,Happy,Lux,Reverie,Yochabelle

Calypso,Harlequin,Lybburtie,Rhuivnyin,Zepplyn

Capreece,Heaven,Magnolia,Russia,Zipaya

Cascade,Hella,Maitre,Saig,Zipporah

Cassiopeia,Hermyanie,Mali,Saylor,Zoigh

Catalina,Hero,Malybu,Sã-kõ-yã,Zuzu

Cedar,Heziachiah,Manhattan,Saoirse,

Cedee,Holland,Maplejo,Sapphire,

Celestial,Honesty,Mavity,Sativalyn,

Celtic,Icelynn,Mayte,Season,

Charm,Immaculata,McCall,Seattle,

PatrickHall
Obsidian | Level 7

Ray,

That is a great suggestion and a well-founded, scalable, and contemporary method for addressing missing values in a predictive model. The idea is that a decision tree will use patterns detected from *all* the variables - which may not be obvious to us, e.g. 2-way correlations - to predict the missing value for each observation.

Several other best practices for handling missing values include:

1. Simply leaving the missing values in the data and using a decision tree or an ensemble of decision trees (i.e. random forest and/or gradient boosting) as your final predictive model.

Decision trees handle missing values at least 2 different ways:

--- In training they can group missing values in bins by themselves or along with other values of a variable, and use missing values to build the predictive model.

--- Surrogate rules: decision trees can use a variable like "State" to make a decision about a variable like "ZipCode" if it encounters a missing value for "ZipCode".

2. Impute the missing values however you like but retain a binary missing value indicator variable, so that missingness can be used to help make your final predictions.

Hope that helps.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1680 views
  • 3 likes
  • 8 in conversation