DATA Step, Macro, Functions and more

The Birthday Paradox

Reply
Trusted Advisor
Posts: 1,301

The Birthday Paradox

The birthday paradox states that in any group of 23 people there is a 50% chance that two of them share a birthday. The BBC recently published an article that shows 16 of the 32 World Cup teams, each consisting of 23 players, have shared birthdays, thus demonstrating the paradox precisely. Today’s exercise asks you to recreate their calculation.

You can obtain the same listing of player birthdays that the BBC used from FIFA. Another source is the player rosters at WikiPedia.

Your task is to demonstrate that the 2014 World Cup rosters honor the birthday paradox.

SOURCE: Birthday Paradox | Programming Praxisdox

Trusted Advisor
Posts: 1,301

Re: The Birthday Paradox

Below is my solution, I encourage you to attempt to solve it yourself before looking over what I have done, however, it is necessary to point out one set of items I needed to address:

1. Apparently Wikipedia has some of the birthday's for players incorrect, at least according to the BBC/FIFA data, I adjusted them to match with the following:

proc sql;

update foo

set bday='15MAR1984'd , mon=3 , day=15 , year=1984

where nat='Algeria' and name='Medhi Lacen';

update foo

set bday='23JUN1984'd , mon=6 , day=23 , year=1984

where nat='Chile' and name='José Manuel Rojas';

update foo

set bday='01JUN1984'd , mon=6 , day=1 , year=1984

where nat='Chile' and name='Jean Beausejour';

update foo

set bday='19FEB1991'd , mon=2 , day=19 , year=1991

where nat='Germany' and name='Christoph Kramer';

delete from foo

where nat='Croatia' and name='Milan Badelj'; *out due to injury, replaced by Ivan Mocinic;

insert into foo

set nat='Croatia'

  , name='Ivan Mocinic'

  , mon=4

  , day=30

  , year=1993

  , bday='30APR1993'd

  , caps=0;

quit;

2. I found it easier to use the following Wikipedia link to collect my data: http://en.wikipedia.org/w/api.php?format=json&action=query&titles=2014_FIFA_World_Cup_squads&prop=re...

filename u url 'http://en.wikipedia.org/w/api.php?format=json&action=query&titles=2014_FIFA_World_Cup_squads&prop=re...

data foo;

format bday date9.;

length nat name $ 40;

retain one 1 nat no ;

infile u dlm='=|[](){}' recfm=s lrecl=1000000 column=col start=col;

if no=23 then call missing(nat);

if missing(nat) then do;

input @'\n===' nat : $32. @;

if countw(nat)>3 then stop;

input @'{{nat fs start}}' @;

end;

input @'no=' no : @'pos=' pos $2. @'name=' name :$40. @'age2' (_n_ _n_ _n_ _n_) (: ??) +(-22) @;

if missing(_n_) then input @'df=y' @'|' (year mon day) (Smiley Happy @;

else input @'age2|' (_n_ _n_ _n_) (Smiley Happy (year mon day) (Smiley Happy +(-10) @;

bday=mdy(mon , day , year);

input @'caps=' caps : +5 club :$128. @'clubnat=' clubnat $3. @'\n' @@;

name=htmldecode(prxchange('s/\\u(.{4})/&#x$1;/',-1,name));

club=htmldecode(prxchange('s/\\u(.{4})/&#x$1;/',-1,club));

run;

proc sql;

/*Wikipedia bday's incorrect? set to match BBC data */

update foo

set bday='15MAR1984'd , mon=3 , day=15 , year=1984

where nat='Algeria' and name='Medhi Lacen';

update foo

set bday='23JUN1984'd , mon=6 , day=23 , year=1984

where nat='Chile' and name='José Manuel Rojas';

update foo

set bday='01JUN1984'd , mon=6 , day=1 , year=1984

where nat='Chile' and name='Jean Beausejour';

update foo

set bday='19FEB1991'd , mon=2 , day=19 , year=1991

where nat='Germany' and name='Christoph Kramer';

delete from foo

where nat='Croatia' and name='Milan Badelj'; *out due to injury, replaced by Ivan Mocinic;

insert into foo

set nat='Croatia'

  , name='Ivan Mocinic'

  , mon=4

  , day=30

  , year=1993

  , bday='30APR1993'd

  , caps=0;

quit;

proc means data=foo nway noprint;

class nat mon day;

var caps;

output out=n(where=(_freq_>1))

idgroup( max(caps) out[2](name bday)=)/autolabel autoname;

run;

*Some teams have multiple pairs, we choose one only;

data n;

set n;

by nat;

if first.nat;

proc print data=n;

var nat name: bday:;

run;

SAS Super FREQ
Posts: 3,755

Re: The Birthday Paradox

Sounds like a fun problem. If anyone wants to solve it by using SAS/IML, here are some basics on vectorizing teh computations in the Birthday paradox: Vectorized computations and the birthday matching problem - The DO Loop

Super User
Posts: 5,516

Re: The Birthday Paradox

Hmmm...

SAS seems to agree that the cutoff is 23 people.  However, the percentage comes in slightly different than that in the article.  Test program:

data _null_;

   prob = 1;

   do i=1 to 25;

      prob = prob * (365 - i) / 365;

      put i= prob=;

   end;

run;

In this program, i is the number of people being added to the starting group of 1 person.  So the group of 23 people is reached when i=22.  PROB is the probability that all people in the group have different birthdays.

Respected Advisor
Posts: 3,156

Re: The Birthday Paradox

Posted in reply to Astounding

Yes. the perm() comes up with the same number, the prob is the the probability that at least 2 person sharing birthday in a group of 23:

data _null_;

prob=1-perm(365,23)/365**23;

put prob=;

run;

Haikuo

Ask a Question
Discussion stats
  • 4 replies
  • 661 views
  • 0 likes
  • 4 in conversation