Learner
Posts: 1

# SAS analysis of causation for rare events

I'm familiar with basic statistical techniques like anova, regression and trend forecasting. However, I want to use SAS to identify possible causal relations when there are numerous different types of rare categorical events occurring over time. In other words, event Y may have occurred 5 times in the last 3 years, and every time it was just preceded by event X. There may be hundreds of these different events occurring though, most of them just noise and unrelated.

I'm guessing that this is a common statistical problem and SAS has procedures to investigate data like this?

Posts: 5,053

## Re: SAS analysis of causation for rare events

I don't know of any specialized SAS procedure to tackle this problem but you could look at it with common tests. The idea is to compare the frequency of events occuring in a "causation window" just before rare events and elsewhere in the sequence. If a causing event exists, it should be detected as overly frequent inside "causation windows". I devised a small simulation to show the idea.

``````/* Simulate a sequence made of 21 kinds of random events
occuring about every hour over a year. All event kinds
occur frequently except for event = 1 which is quite rare;
event = 1 should occur only 6 times on average per year. */
data sequence;
call streaminit(8559);
time = '01JAN2001:00:00:00'dt;
do while ( time < '31DEC2001:23:00:00'dt);
event = rand("TABLE",
6E-4,0.05,0.05,0.05,0.05,
0.05,0.05,0.05,0.05,0.05,
0.05,0.05,0.05,0.05,0.05,
0.05,0.05,0.05,0.05,0.05);
time = time + '01:00:00't * rand("EXPONENTIAL");
output;
end;
run;

/* Define a time window within which causation is
expected to occur. */
%let timeWindow=06:00:00; /* Six hours */

/* Determine which events occured during the time window
prior to our rare events (event = 1) */
proc sort data=sequence; by descending time; run;

data sequenceW;
set sequence;
retain lastEventTime 1e40;
inWindow = time > lastEventTime - "&timeWindow."t;
if event = 1 then lastEventTime = time;
drop lastEventTime;
run;

/* Compare the frequencies of every event kind inside
and outside the causation time window. An event causing
our rare event should stand out as being too frequent
inside the time window.
Look at individual cell Chi-Squares and at the overall
Likelihood Ratio Chi-Square Test */
proc freq data=sequenceW;
table event*inWindow / cellchi2;
exact lrchi / mc seed=86556;
run;

/* The simulation above represented the null hypothesis
where no event was causing the rare event. */

/* Now, choose event = 2 as a causing event. This
is simulated by adding extra event=2, two hours before
each rare event. Note that the original event = 2
"noise" is left in the sequence. */
data sequencePlus;
set sequence sequence(where=(event=1) in=special);
if special then do;
time = time - '02:00:00't;
event = 2;
end;
run;

/* Repeat the analysis */
proc sort data=sequencePlus; by descending time; run;

%let timeWindow=06:00:00;

data sequencePlusW;
set sequencePlus;
retain lastEventTime 1e40;
inWindow = time > lastEventTime - "&timeWindow."t;
if event = 1 then lastEventTime = time;
drop lastEventTime;
run;

proc freq data=sequencePlusW;
table event*inWindow / cellchi2;
exact lrchi / mc seed=86556;
run;

/* Now the presence of a causing event is shown by
the overall Likelihood Ratio Chi-Square Test and
the identity of the causing event (event = 2) is made
clear by individual cell Chi-Squares */``````
PG
Posts: 2,655

## Re: SAS analysis of causation for rare events

I would be very careful about something like this, with such small N for the number of events.  Otherwise, you end up saying things like: Pope Francis visiting the United States causes a total lunar eclipse.  Big data and rare events can point out a concurrency of events that has absolutely nothing to do with causality.

Currently, I don't believe any of the SAS procs really address statistical causality, even though a lot of us statistician types treat the results as if causality was demonstrated.  The closest I would come to a causality argument would be to use Bayesian methods, such as in PROC MCMC, with a very strict and informative prior.

Steve Denham

Posts: 5,053