BookmarkSubscribeRSS Feed
FriedEgg
SAS Employee

The birthday paradox states that in any group of 23 people there is a 50% chance that two of them share a birthday. The BBC recently published an article that shows 16 of the 32 World Cup teams, each consisting of 23 players, have shared birthdays, thus demonstrating the paradox precisely. Today’s exercise asks you to recreate their calculation.

You can obtain the same listing of player birthdays that the BBC used from FIFA. Another source is the player rosters at WikiPedia.

Your task is to demonstrate that the 2014 World Cup rosters honor the birthday paradox.

SOURCE: Birthday Paradox | Programming Praxisdox

4 REPLIES 4
FriedEgg
SAS Employee

Below is my solution, I encourage you to attempt to solve it yourself before looking over what I have done, however, it is necessary to point out one set of items I needed to address:

1. Apparently Wikipedia has some of the birthday's for players incorrect, at least according to the BBC/FIFA data, I adjusted them to match with the following:

proc sql;

update foo

set bday='15MAR1984'd , mon=3 , day=15 , year=1984

where nat='Algeria' and name='Medhi Lacen';

update foo

set bday='23JUN1984'd , mon=6 , day=23 , year=1984

where nat='Chile' and name='José Manuel Rojas';

update foo

set bday='01JUN1984'd , mon=6 , day=1 , year=1984

where nat='Chile' and name='Jean Beausejour';

update foo

set bday='19FEB1991'd , mon=2 , day=19 , year=1991

where nat='Germany' and name='Christoph Kramer';

delete from foo

where nat='Croatia' and name='Milan Badelj'; *out due to injury, replaced by Ivan Mocinic;

insert into foo

set nat='Croatia'

  , name='Ivan Mocinic'

  , mon=4

  , day=30

  , year=1993

  , bday='30APR1993'd

  , caps=0;

quit;

2. I found it easier to use the following Wikipedia link to collect my data: http://en.wikipedia.org/w/api.php?format=json&action=query&titles=2014_FIFA_World_Cup_squads&prop=re...

filename u url 'http://en.wikipedia.org/w/api.php?format=json&action=query&titles=2014_FIFA_World_Cup_squads&prop=re...

data foo;

format bday date9.;

length nat name $ 40;

retain one 1 nat no ;

infile u dlm='=|[](){}' recfm=s lrecl=1000000 column=col start=col;

if no=23 then call missing(nat);

if missing(nat) then do;

input @'\n===' nat : $32. @;

if countw(nat)>3 then stop;

input @'{{nat fs start}}' @;

end;

input @'no=' no : @'pos=' pos $2. @'name=' name :$40. @'age2' (_n_ _n_ _n_ _n_) (: ??) +(-22) @;

if missing(_n_) then input @'df=y' @'|' (year mon day) (:) @;

else input @'age2|' (_n_ _n_ _n_) (:) (year mon day) (:) +(-10) @;

bday=mdy(mon , day , year);

input @'caps=' caps : +5 club :$128. @'clubnat=' clubnat $3. @'\n' @@;

name=htmldecode(prxchange('s/\\u(.{4})/&#x$1;/',-1,name));

club=htmldecode(prxchange('s/\\u(.{4})/&#x$1;/',-1,club));

run;

proc sql;

/*Wikipedia bday's incorrect? set to match BBC data */

update foo

set bday='15MAR1984'd , mon=3 , day=15 , year=1984

where nat='Algeria' and name='Medhi Lacen';

update foo

set bday='23JUN1984'd , mon=6 , day=23 , year=1984

where nat='Chile' and name='José Manuel Rojas';

update foo

set bday='01JUN1984'd , mon=6 , day=1 , year=1984

where nat='Chile' and name='Jean Beausejour';

update foo

set bday='19FEB1991'd , mon=2 , day=19 , year=1991

where nat='Germany' and name='Christoph Kramer';

delete from foo

where nat='Croatia' and name='Milan Badelj'; *out due to injury, replaced by Ivan Mocinic;

insert into foo

set nat='Croatia'

  , name='Ivan Mocinic'

  , mon=4

  , day=30

  , year=1993

  , bday='30APR1993'd

  , caps=0;

quit;

proc means data=foo nway noprint;

class nat mon day;

var caps;

output out=n(where=(_freq_>1))

idgroup( max(caps) out[2](name bday)=)/autolabel autoname;

run;

*Some teams have multiple pairs, we choose one only;

data n;

set n;

by nat;

if first.nat;

proc print data=n;

var nat name: bday:;

run;

Rick_SAS
SAS Super FREQ

Sounds like a fun problem. If anyone wants to solve it by using SAS/IML, here are some basics on vectorizing teh computations in the Birthday paradox: Vectorized computations and the birthday matching problem - The DO Loop

Astounding
PROC Star

Hmmm...

SAS seems to agree that the cutoff is 23 people.  However, the percentage comes in slightly different than that in the article.  Test program:

data _null_;

   prob = 1;

   do i=1 to 25;

      prob = prob * (365 - i) / 365;

      put i= prob=;

   end;

run;

In this program, i is the number of people being added to the starting group of 1 person.  So the group of 23 people is reached when i=22.  PROB is the probability that all people in the group have different birthdays.

Haikuo
Onyx | Level 15

Yes. the perm() comes up with the same number, the prob is the the probability that at least 2 person sharing birthday in a group of 23:

data _null_;

prob=1-perm(365,23)/365**23;

put prob=;

run;

Haikuo

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1509 views
  • 0 likes
  • 4 in conversation