BookmarkSubscribeRSS Feed
markshanks
Calcite | Level 5

I'm familiar with basic statistical techniques like anova, regression and trend forecasting. However, I want to use SAS to identify possible causal relations when there are numerous different types of rare categorical events occurring over time. In other words, event Y may have occurred 5 times in the last 3 years, and every time it was just preceded by event X. There may be hundreds of these different events occurring though, most of them just noise and unrelated.

 

I'm guessing that this is a common statistical problem and SAS has procedures to investigate data like this?

3 REPLIES 3
PGStats
Opal | Level 21

I don't know of any specialized SAS procedure to tackle this problem but you could look at it with common tests. The idea is to compare the frequency of events occuring in a "causation window" just before rare events and elsewhere in the sequence. If a causing event exists, it should be detected as overly frequent inside "causation windows". I devised a small simulation to show the idea.

 

/* Simulate a sequence made of 21 kinds of random events 
 occuring about every hour over a year. All event kinds 
 occur frequently except for event = 1 which is quite rare; 
 event = 1 should occur only 6 times on average per year. */ 
data sequence;
call streaminit(8559);
time = '01JAN2001:00:00:00'dt;
do while ( time < '31DEC2001:23:00:00'dt);
    event = rand("TABLE",
        6E-4,0.05,0.05,0.05,0.05,
        0.05,0.05,0.05,0.05,0.05,
        0.05,0.05,0.05,0.05,0.05,
        0.05,0.05,0.05,0.05,0.05);
    time = time + '01:00:00't * rand("EXPONENTIAL");
    output;
    end;
run;

/* Define a time window within which causation is
 expected to occur. */
%let timeWindow=06:00:00; /* Six hours */

/* Determine which events occured during the time window 
 prior to our rare events (event = 1) */
proc sort data=sequence; by descending time; run;

data sequenceW;
set sequence;
retain lastEventTime 1e40;
inWindow = time > lastEventTime - "&timeWindow."t;
if event = 1 then lastEventTime = time;
drop lastEventTime;
run;

/* Compare the frequencies of every event kind inside 
 and outside the causation time window. An event causing
 our rare event should stand out as being too frequent
 inside the time window.
 Look at individual cell Chi-Squares and at the overall 
 Likelihood Ratio Chi-Square Test */ 
proc freq data=sequenceW;
table event*inWindow / cellchi2;
exact lrchi / mc seed=86556;
run;

/* The simulation above represented the null hypothesis
 where no event was causing the rare event. */ 

/* Now, choose event = 2 as a causing event. This
 is simulated by adding extra event=2, two hours before 
 each rare event. Note that the original event = 2
 "noise" is left in the sequence. */
data sequencePlus;
set sequence sequence(where=(event=1) in=special);
if special then do;
    time = time - '02:00:00't;
    event = 2;
    end;
run;

/* Repeat the analysis */
proc sort data=sequencePlus; by descending time; run;

%let timeWindow=06:00:00;

data sequencePlusW;
set sequencePlus;
retain lastEventTime 1e40;
inWindow = time > lastEventTime - "&timeWindow."t;
if event = 1 then lastEventTime = time;
drop lastEventTime;
run;

proc freq data=sequencePlusW;
table event*inWindow / cellchi2;
exact lrchi / mc seed=86556;
run;

/* Now the presence of a causing event is shown by 
 the overall Likelihood Ratio Chi-Square Test and
 the identity of the causing event (event = 2) is made 
 clear by individual cell Chi-Squares */
PG
SteveDenham
Jade | Level 19

I would be very careful about something like this, with such small N for the number of events.  Otherwise, you end up saying things like: Pope Francis visiting the United States causes a total lunar eclipse.  Big data and rare events can point out a concurrency of events that has absolutely nothing to do with causality.

 

Currently, I don't believe any of the SAS procs really address statistical causality, even though a lot of us statistician types treat the results as if causality was demonstrated.  The closest I would come to a causality argument would be to use Bayesian methods, such as in PROC MCMC, with a very strict and informative prior.

 

Steve Denham

PGStats
Opal | Level 21

Very true Steve! I think the demonstration of causality actually requires the identification of a plausible mechanism. Otherwise, it's only **bleep** statistics. Smiley Happy  <-- The **bleep** was inserted by the text editor; talk about freedom of speech!

PG

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1347 views
  • 4 likes
  • 3 in conversation