turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- The Birthday Paradox

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-28-2014 01:37 AM

The birthday paradox states that in any group of 23 people there is a 50% chance that two of them share a birthday. The BBC recently published an article that shows 16 of the 32 World Cup teams, each consisting of 23 players, have shared birthdays, thus demonstrating the paradox precisely. Today’s exercise asks you to recreate their calculation.

You can obtain the same listing of player birthdays that the BBC used from FIFA. Another source is the player rosters at WikiPedia.

Your task is to demonstrate that the 2014 World Cup rosters honor the birthday paradox.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-28-2014 01:45 AM

Below is my solution, I encourage you to attempt to solve it yourself before looking over what I have done, however, it is necessary to point out one set of items I needed to address:

1. Apparently Wikipedia has some of the birthday's for players incorrect, at least according to the BBC/FIFA data, I adjusted them to match with the following:

proc sql;

update foo

set bday='15MAR1984'd , mon=3 , day=15 , year=1984

where nat='Algeria' and name='Medhi Lacen';

update foo

set bday='23JUN1984'd , mon=6 , day=23 , year=1984

where nat='Chile' and name='José Manuel Rojas';

update foo

set bday='01JUN1984'd , mon=6 , day=1 , year=1984

where nat='Chile' and name='Jean Beausejour';

update foo

set bday='19FEB1991'd , mon=2 , day=19 , year=1991

where nat='Germany' and name='Christoph Kramer';

delete from foo

where nat='Croatia' and name='Milan Badelj'; *out due to injury, replaced by Ivan Mocinic;

insert into foo

set nat='Croatia'

, name='Ivan Mocinic'

, mon=4

, day=30

, year=1993

, bday='30APR1993'd

, caps=0;

quit;

2. I found it easier to use the following Wikipedia link to collect my data: http://en.wikipedia.org/w/api.php?format=json&action=query&titles=2014_FIFA_World_Cup_squads&prop=re...

filename u url 'http://en.wikipedia.org/w/api.php?format=json&action=query&titles=2014_FIFA_World_Cup_squads&prop=re...

data foo;

format bday date9.;

length nat name $ 40;

retain one 1 nat no ;

infile u dlm='=|[](){}' recfm=s lrecl=1000000 column=col start=col;

if no=23 then call missing(nat);

if missing(nat) then do;

input @'\n===' nat : $32. @;

if countw(nat)>3 then stop;

input @'{{nat fs start}}' @;

end;

input @'no=' no : @'pos=' pos $2. @'name=' name :$40. @'age2' (_n_ _n_ _n_ _n_) (: ??) +(-22) @;

if missing(_n_) then input @'df=y' @'|' (year mon day) ( @;

else input @'age2|' (_n_ _n_ _n_) ( (year mon day) ( +(-10) @;

bday=mdy(mon , day , year);

input @'caps=' caps : +5 club :$128. @'clubnat=' clubnat $3. @'\n' @@;

name=htmldecode(prxchange('s/\\u(.{4})/&#x$1;/',-1,name));

club=htmldecode(prxchange('s/\\u(.{4})/&#x$1;/',-1,club));

run;

proc sql;

/*Wikipedia bday's incorrect? set to match BBC data */

update foo

set bday='15MAR1984'd , mon=3 , day=15 , year=1984

where nat='Algeria' and name='Medhi Lacen';

update foo

set bday='23JUN1984'd , mon=6 , day=23 , year=1984

where nat='Chile' and name='José Manuel Rojas';

update foo

set bday='01JUN1984'd , mon=6 , day=1 , year=1984

where nat='Chile' and name='Jean Beausejour';

update foo

set bday='19FEB1991'd , mon=2 , day=19 , year=1991

where nat='Germany' and name='Christoph Kramer';

delete from foo

where nat='Croatia' and name='Milan Badelj'; *out due to injury, replaced by Ivan Mocinic;

insert into foo

set nat='Croatia'

, name='Ivan Mocinic'

, mon=4

, day=30

, year=1993

, bday='30APR1993'd

, caps=0;

quit;

proc means data=foo nway noprint;

class nat mon day;

var caps;

output out=n(where=(_freq_>1))

idgroup( max(caps) out[2](name bday)=)/autolabel autoname;

run;

*Some teams have multiple pairs, we choose one only;

data n;

set n;

by nat;

if first.nat;

proc print data=n;

var nat name: bday:;

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-30-2014 12:48 PM

Sounds like a fun problem. If anyone wants to solve it by using SAS/IML, here are some basics on vectorizing teh computations in the Birthday paradox: Vectorized computations and the birthday matching problem - The DO Loop

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-30-2014 02:43 PM

Hmmm...

SAS seems to agree that the cutoff is 23 people. However, the percentage comes in slightly different than that in the article. Test program:

data _null_;

prob = 1;

do i=1 to 25;

prob = prob * (365 - i) / 365;

put i= prob=;

end;

run;

In this program, i is the number of people being added to the starting group of 1 person. So the group of 23 people is reached when i=22. PROB is the probability that all people in the group have different birthdays.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-30-2014 03:11 PM

Yes. the perm() comes up with the same number, the prob is the the probability that at least 2 person sharing birthday in a group of 23:

data _null_;

prob=1-perm(365,23)/365**23;

put prob=;

run;

Haikuo