## DATA Step, Macro, Functions and more

Posts: 1,318

The birthday paradox states that in any group of 23 people there is a 50% chance that two of them share a birthday. The BBC recently published an article that shows 16 of the 32 World Cup teams, each consisting of 23 players, have shared birthdays, thus demonstrating the paradox precisely. Today’s exercise asks you to recreate their calculation.

You can obtain the same listing of player birthdays that the BBC used from FIFA. Another source is the player rosters at WikiPedia.

Posts: 1,318

Below is my solution, I encourage you to attempt to solve it yourself before looking over what I have done, however, it is necessary to point out one set of items I needed to address:

1. Apparently Wikipedia has some of the birthday's for players incorrect, at least according to the BBC/FIFA data, I adjusted them to match with the following:

```proc sql;
update foo
set bday='15MAR1984'd , mon=3 , day=15 , year=1984
where nat='Algeria' and name='Medhi Lacen';
update foo
set bday='23JUN1984'd , mon=6 , day=23 , year=1984
where nat='Chile' and name='José Manuel Rojas';
update foo
set bday='01JUN1984'd , mon=6 , day=1 , year=1984
where nat='Chile' and name='Jean Beausejour';
update foo
set bday='19FEB1991'd , mon=2 , day=19 , year=1991
where nat='Germany' and name='Christoph Kramer';
delete from foo
where nat='Croatia' and name='Milan Badelj'; *out due to injury, replaced by Ivan Mocinic;
insert into foo
set nat='Croatia'
, name='Ivan Mocinic'
, mon=4
, day=30
, year=1993
, bday='30APR1993'd
, caps=0;
quit;
```

2. I found it easier to use the following Wikipedia link to collect my data: http://en.wikipedia.org/w/api.php?format=json&action=query&titles=2014_FIFA_World_Cup_squads&prop=re...

data foo;

format bday date9.;

length nat name \$ 40;

retain one 1 nat no ;

infile u dlm='=|[](){}' recfm=s lrecl=1000000 column=col start=col;

if no=23 then call missing(nat);

if missing(nat) then do;

input @'\n===' nat : \$32. @;

if countw(nat)>3 then stop;

input @'{{nat fs start}}' @;

end;

input @'no=' no : @'pos=' pos \$2. @'name=' name :\$40. @'age2' (_n_ _n_ _n_ _n_) (: ??) +(-22) @;

if missing(_n_) then input @'df=y' @'|' (year mon day) ( @;

else input @'age2|' (_n_ _n_ _n_) ( (year mon day) ( +(-10) @;

bday=mdy(mon , day , year);

input @'caps=' caps : +5 club :\$128. @'clubnat=' clubnat \$3. @'\n' @@;

name=htmldecode(prxchange('s/\\u(.{4})/&#x\$1;/',-1,name));

club=htmldecode(prxchange('s/\\u(.{4})/&#x\$1;/',-1,club));

run;

proc sql;

/*Wikipedia bday's incorrect? set to match BBC data */

update foo

set bday='15MAR1984'd , mon=3 , day=15 , year=1984

where nat='Algeria' and name='Medhi Lacen';

update foo

set bday='23JUN1984'd , mon=6 , day=23 , year=1984

where nat='Chile' and name='José Manuel Rojas';

update foo

set bday='01JUN1984'd , mon=6 , day=1 , year=1984

where nat='Chile' and name='Jean Beausejour';

update foo

set bday='19FEB1991'd , mon=2 , day=19 , year=1991

where nat='Germany' and name='Christoph Kramer';

delete from foo

where nat='Croatia' and name='Milan Badelj'; *out due to injury, replaced by Ivan Mocinic;

insert into foo

set nat='Croatia'

, name='Ivan Mocinic'

, mon=4

, day=30

, year=1993

, bday='30APR1993'd

, caps=0;

quit;

proc means data=foo nway noprint;

class nat mon day;

var caps;

output out=n(where=(_freq_>1))

idgroup( max(caps) out[2](name bday)=)/autolabel autoname;

run;

*Some teams have multiple pairs, we choose one only;

data n;

set n;

by nat;

if first.nat;

proc print data=n;

var nat name: bday:;

run;

SAS Super FREQ
Posts: 4,242

Sounds like a fun problem. If anyone wants to solve it by using SAS/IML, here are some basics on vectorizing teh computations in the Birthday paradox: Vectorized computations and the birthday matching problem - The DO Loop

Super User
Posts: 6,781

Hmmm...

SAS seems to agree that the cutoff is 23 people.  However, the percentage comes in slightly different than that in the article.  Test program:

data _null_;

prob = 1;

do i=1 to 25;

prob = prob * (365 - i) / 365;

put i= prob=;

end;

run;

In this program, i is the number of people being added to the starting group of 1 person.  So the group of 23 people is reached when i=22.  PROB is the probability that all people in the group have different birthdays.

Posts: 3,167

Yes. the perm() comes up with the same number, the prob is the the probability that at least 2 person sharing birthday in a group of 23:

data _null_;

prob=1-perm(365,23)/365**23;

put prob=;

run;

Haikuo

Discussion stats
• 4 replies
• 753 views
• 0 likes
• 4 in conversation