I'm familiar with basic statistical techniques like anova, regression and trend forecasting. However, I want to use SAS to identify possible causal relations when there are numerous different types of rare categorical events occurring over time. In other words, event Y may have occurred 5 times in the last 3 years, and every time it was just preceded by event X. There may be hundreds of these different events occurring though, most of them just noise and unrelated.
I'm guessing that this is a common statistical problem and SAS has procedures to investigate data like this?
I don't know of any specialized SAS procedure to tackle this problem but you could look at it with common tests. The idea is to compare the frequency of events occuring in a "causation window" just before rare events and elsewhere in the sequence. If a causing event exists, it should be detected as overly frequent inside "causation windows". I devised a small simulation to show the idea.
/* Simulate a sequence made of 21 kinds of random events
occuring about every hour over a year. All event kinds
occur frequently except for event = 1 which is quite rare;
event = 1 should occur only 6 times on average per year. */
data sequence;
call streaminit(8559);
time = '01JAN2001:00:00:00'dt;
do while ( time < '31DEC2001:23:00:00'dt);
event = rand("TABLE",
6E-4,0.05,0.05,0.05,0.05,
0.05,0.05,0.05,0.05,0.05,
0.05,0.05,0.05,0.05,0.05,
0.05,0.05,0.05,0.05,0.05);
time = time + '01:00:00't * rand("EXPONENTIAL");
output;
end;
run;
/* Define a time window within which causation is
expected to occur. */
%let timeWindow=06:00:00; /* Six hours */
/* Determine which events occured during the time window
prior to our rare events (event = 1) */
proc sort data=sequence; by descending time; run;
data sequenceW;
set sequence;
retain lastEventTime 1e40;
inWindow = time > lastEventTime - "&timeWindow."t;
if event = 1 then lastEventTime = time;
drop lastEventTime;
run;
/* Compare the frequencies of every event kind inside
and outside the causation time window. An event causing
our rare event should stand out as being too frequent
inside the time window.
Look at individual cell Chi-Squares and at the overall
Likelihood Ratio Chi-Square Test */
proc freq data=sequenceW;
table event*inWindow / cellchi2;
exact lrchi / mc seed=86556;
run;
/* The simulation above represented the null hypothesis
where no event was causing the rare event. */
/* Now, choose event = 2 as a causing event. This
is simulated by adding extra event=2, two hours before
each rare event. Note that the original event = 2
"noise" is left in the sequence. */
data sequencePlus;
set sequence sequence(where=(event=1) in=special);
if special then do;
time = time - '02:00:00't;
event = 2;
end;
run;
/* Repeat the analysis */
proc sort data=sequencePlus; by descending time; run;
%let timeWindow=06:00:00;
data sequencePlusW;
set sequencePlus;
retain lastEventTime 1e40;
inWindow = time > lastEventTime - "&timeWindow."t;
if event = 1 then lastEventTime = time;
drop lastEventTime;
run;
proc freq data=sequencePlusW;
table event*inWindow / cellchi2;
exact lrchi / mc seed=86556;
run;
/* Now the presence of a causing event is shown by
the overall Likelihood Ratio Chi-Square Test and
the identity of the causing event (event = 2) is made
clear by individual cell Chi-Squares */
I would be very careful about something like this, with such small N for the number of events. Otherwise, you end up saying things like: Pope Francis visiting the United States causes a total lunar eclipse. Big data and rare events can point out a concurrency of events that has absolutely nothing to do with causality.
Currently, I don't believe any of the SAS procs really address statistical causality, even though a lot of us statistician types treat the results as if causality was demonstrated. The closest I would come to a causality argument would be to use Bayesian methods, such as in PROC MCMC, with a very strict and informative prior.
Steve Denham
Very true Steve! I think the demonstration of causality actually requires the identification of a plausible mechanism. Otherwise, it's only **bleep** statistics. <-- The **bleep** was inserted by the text editor; talk about freedom of speech!
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.