09-25-2015 08:46 PM
I'm familiar with basic statistical techniques like anova, regression and trend forecasting. However, I want to use SAS to identify possible causal relations when there are numerous different types of rare categorical events occurring over time. In other words, event Y may have occurred 5 times in the last 3 years, and every time it was just preceded by event X. There may be hundreds of these different events occurring though, most of them just noise and unrelated.
I'm guessing that this is a common statistical problem and SAS has procedures to investigate data like this?
09-26-2015 11:48 PM
I don't know of any specialized SAS procedure to tackle this problem but you could look at it with common tests. The idea is to compare the frequency of events occuring in a "causation window" just before rare events and elsewhere in the sequence. If a causing event exists, it should be detected as overly frequent inside "causation windows". I devised a small simulation to show the idea.
/* Simulate a sequence made of 21 kinds of random events occuring about every hour over a year. All event kinds occur frequently except for event = 1 which is quite rare; event = 1 should occur only 6 times on average per year. */ data sequence; call streaminit(8559); time = '01JAN2001:00:00:00'dt; do while ( time < '31DEC2001:23:00:00'dt); event = rand("TABLE", 6E-4,0.05,0.05,0.05,0.05, 0.05,0.05,0.05,0.05,0.05, 0.05,0.05,0.05,0.05,0.05, 0.05,0.05,0.05,0.05,0.05); time = time + '01:00:00't * rand("EXPONENTIAL"); output; end; run; /* Define a time window within which causation is expected to occur. */ %let timeWindow=06:00:00; /* Six hours */ /* Determine which events occured during the time window prior to our rare events (event = 1) */ proc sort data=sequence; by descending time; run; data sequenceW; set sequence; retain lastEventTime 1e40; inWindow = time > lastEventTime - "&timeWindow."t; if event = 1 then lastEventTime = time; drop lastEventTime; run; /* Compare the frequencies of every event kind inside and outside the causation time window. An event causing our rare event should stand out as being too frequent inside the time window. Look at individual cell Chi-Squares and at the overall Likelihood Ratio Chi-Square Test */ proc freq data=sequenceW; table event*inWindow / cellchi2; exact lrchi / mc seed=86556; run; /* The simulation above represented the null hypothesis where no event was causing the rare event. */ /* Now, choose event = 2 as a causing event. This is simulated by adding extra event=2, two hours before each rare event. Note that the original event = 2 "noise" is left in the sequence. */ data sequencePlus; set sequence sequence(where=(event=1) in=special); if special then do; time = time - '02:00:00't; event = 2; end; run; /* Repeat the analysis */ proc sort data=sequencePlus; by descending time; run; %let timeWindow=06:00:00; data sequencePlusW; set sequencePlus; retain lastEventTime 1e40; inWindow = time > lastEventTime - "&timeWindow."t; if event = 1 then lastEventTime = time; drop lastEventTime; run; proc freq data=sequencePlusW; table event*inWindow / cellchi2; exact lrchi / mc seed=86556; run; /* Now the presence of a causing event is shown by the overall Likelihood Ratio Chi-Square Test and the identity of the causing event (event = 2) is made clear by individual cell Chi-Squares */
09-28-2015 12:44 PM
I would be very careful about something like this, with such small N for the number of events. Otherwise, you end up saying things like: Pope Francis visiting the United States causes a total lunar eclipse. Big data and rare events can point out a concurrency of events that has absolutely nothing to do with causality.
Currently, I don't believe any of the SAS procs really address statistical causality, even though a lot of us statistician types treat the results as if causality was demonstrated. The closest I would come to a causality argument would be to use Bayesian methods, such as in PROC MCMC, with a very strict and informative prior.
09-28-2015 02:47 PM
Very true Steve! I think the demonstration of causality actually requires the identification of a plausible mechanism. Otherwise, it's only **bleep** statistics. <-- The **bleep** was inserted by the text editor; talk about freedom of speech!