Re: SAS analysis of causation for rare events

markshanks · Posted 09-25-2015 08:46 PM

I'm familiar with basic statistical techniques like anova, regression and trend forecasting. However, I want to use SAS to identify possible causal relations when there are numerous different types of rare categorical events occurring over time. In other words, event Y may have occurred 5 times in the last 3 years, and every time it was just preceded by event X. There may be hundreds of these different events occurring though, most of them just noise and unrelated.

I'm guessing that this is a common statistical problem and SAS has procedures to investigate data like this?

PGStats · Posted 09-26-2015 11:48 PM

I don't know of any specialized SAS procedure to tackle this problem but you could look at it with common tests. The idea is to compare the frequency of events occuring in a "causation window" just before rare events and elsewhere in the sequence. If a causing event exists, it should be detected as overly frequent inside "causation windows". I devised a small simulation to show the idea.

/* Simulate a sequence made of 21 kinds of random events 
 occuring about every hour over a year. All event kinds 
 occur frequently except for event = 1 which is quite rare; 
 event = 1 should occur only 6 times on average per year. */ 
data sequence;
call streaminit(8559);
time = '01JAN2001:00:00:00'dt;
do while ( time < '31DEC2001:23:00:00'dt);
    event = rand("TABLE",
        6E-4,0.05,0.05,0.05,0.05,
        0.05,0.05,0.05,0.05,0.05,
        0.05,0.05,0.05,0.05,0.05,
        0.05,0.05,0.05,0.05,0.05);
    time = time + '01:00:00't * rand("EXPONENTIAL");
    output;
    end;
run;

/* Define a time window within which causation is
 expected to occur. */
%let timeWindow=06:00:00; /* Six hours */

/* Determine which events occured during the time window 
 prior to our rare events (event = 1) */
proc sort data=sequence; by descending time; run;

data sequenceW;
set sequence;
retain lastEventTime 1e40;
inWindow = time > lastEventTime - "&timeWindow."t;
if event = 1 then lastEventTime = time;
drop lastEventTime;
run;

/* Compare the frequencies of every event kind inside 
 and outside the causation time window. An event causing
 our rare event should stand out as being too frequent
 inside the time window.
 Look at individual cell Chi-Squares and at the overall 
 Likelihood Ratio Chi-Square Test */ 
proc freq data=sequenceW;
table event*inWindow / cellchi2;
exact lrchi / mc seed=86556;
run;

/* The simulation above represented the null hypothesis
 where no event was causing the rare event. */ 

/* Now, choose event = 2 as a causing event. This
 is simulated by adding extra event=2, two hours before 
 each rare event. Note that the original event = 2
 "noise" is left in the sequence. */
data sequencePlus;
set sequence sequence(where=(event=1) in=special);
if special then do;
    time = time - '02:00:00't;
    event = 2;
    end;
run;

/* Repeat the analysis */
proc sort data=sequencePlus; by descending time; run;

%let timeWindow=06:00:00;

data sequencePlusW;
set sequencePlus;
retain lastEventTime 1e40;
inWindow = time > lastEventTime - "&timeWindow."t;
if event = 1 then lastEventTime = time;
drop lastEventTime;
run;

proc freq data=sequencePlusW;
table event*inWindow / cellchi2;
exact lrchi / mc seed=86556;
run;

/* Now the presence of a causing event is shown by 
 the overall Likelihood Ratio Chi-Square Test and
 the identity of the causing event (event = 2) is made 
 clear by individual cell Chi-Squares */

PG

SteveDenham · Posted 09-28-2015 12:44 PM

I would be very careful about something like this, with such small N for the number of events. Otherwise, you end up saying things like: Pope Francis visiting the United States causes a total lunar eclipse. Big data and rare events can point out a concurrency of events that has absolutely nothing to do with causality.

Currently, I don't believe any of the SAS procs really address statistical causality, even though a lot of us statistician types treat the results as if causality was demonstrated. The closest I would come to a causality argument would be to use Bayesian methods, such as in PROC MCMC, with a very strict and informative prior.

Steve Denham

PGStats · Posted 09-28-2015 02:47 PM

Very true Steve! I think the demonstration of causality actually requires the identification of a plausible mechanism. Otherwise, it's only **bleep** statistics. <-- The **bleep** was inserted by the text editor; talk about freedom of speech!

PG