Suppose I sample 30 widgets from a manufacturing line over the course of a 24 hour day. The sampling times are not necessarily random or evenly distributed. Each widget is inspected and either passes or fails. So I have data like:
data have ;
input time time5. fail ;
format time time5. ;
cards ;
00:45 0
01:00 1
02:45 0
03:15 1
03:45 1
05:30 1
05:45 0
06:00 1
06:30 0
06:45 1
07:00 0
09:00 1
10:15 1
12:30 1
14:00 0
16:00 0
17:15 0
17:30 0
17:45 0
19:00 0
20:45 0
21:30 0
21:45 0
22:00 0
22:15 0
22:30 0
22:45 0
23:00 0
23:15 0
23:30 0
;
run ;
If I plot the data I get:
And looking at the plot, I see two clusters. There was a high failure rate between 0:00 and 12:30, and after 14:00 the failure rate was zero.
Of course usually the pattern isn't as clear. I could have data like:
At first it looks like there may be three clusters, with a low failure rate clusters between 10:00 and 16:00. But then when you think about it there are only four data points in that period. If the failure rate was constant during the day (.3), it would not be unusual to have four consecutive successes (.7**4=.24). So maybe the correct interpretation is that there are no clusters in that data. Or maybe there are two clusters: 00:00-09:30 with a high failure rate and 10:00-24:00 with a lower failure rate.
I think what I'm looking for is a way to group the data into clusters (by time) where consecutive clusters have a failure rate significantly different than the failure rate of neighboring clusters.
Or maybe another thing I might want is to identify any clusters (by time) where I can conclude that the failure rate is, say, <.1 (with 95% confidence).
Would appreciate any thoughts/suggestions. I realize this is not a well-framed question. I feel like I want some sort of cluster analysis? ChatGPT recommended looking into "temporal scan" methods which apparently create different time window sizes and then compare them. But I didn't find much on that approach that seemed helpful.
Quentin,
In statistical theory, usually Cluster Analysis is a non-supervision learning method(a.ka. do not have response variable Y).
But in your special case,you include "failure rate" ,that is to say you need a supervision learning method(a.ka. have response variable Y).
For your binary varible Y,the usual choice is using PROC LOGISTIC (a.k.a Logistic Model).But since you take into account of NON-LINEAR relationship between Y and X ,then you need to consider of SPLINE effect.
Check @Rick_SAS blogs:
https://blogs.sas.com/content/iml/2016/03/21/statistical-analysis-stephen-curry-shooting.html
https://blogs.sas.com/content/iml/2016/03/23/nonparametric-regression-binary-response-sas.html
And further more ,if you are asking about how many of Clusters you should decide. I have to confess that is a world and unsolve problem in statistic theory.
The following code is I take from Rick's blog.
proc logistic data=have;
effect spl = spline(time/degree=2 );
model fail(event='1') = spl / SCALE=NONE AGGREGATE;
effectplot fit(x=time)/obs ;
run;
Thanks @Ksharp . I agree, it feels like it could be a logistic regression problem. But I'm hoping for a method that will decide the number of clusters, and have the p(failure) be the same within a cluster, but jump between clusters. So if I plotted the predicted probabilities, there would be a horizontal line for each cluster, and at each knot there would be a jump. One way I think about it is I want to do a logistic regression where the predictor is categorical (clusterID), but first I want a magical procedure to determine the clusterID for each record.
I'm not thinking of time as a continuous predictor. I'm thinking (hoping) there is some unknown discrete event which occurs and causes a step change in the probability of failure.
I wonder if I should try using PROC NLIN. I know NLIN will search for knots. I would think you could constrain it to say the slope of a spline must be 0. I'm not sure if NLIN is happy to allow a "jump" at each knot (i.e. the regression line is not connected). I've only used NLIN once, but it so flexible seems like it might be worth at try.
Quentin,
I am not a statistician either.
If you want to use PROC NLIN , you have to know the function y=f(x) to fit your data, but you dont' know.
If you want to use Poisson Regression, here is :
http://support.sas.com/kb/24/188.html
But I think Poisson and Logistic Regression would get the similar result. Poisson is for more discrete data ,Negative Binomial is for more and more discrete data.
Or you could try to post it at Forecasting Forum, since you have a time series data.
https://communities.sas.com/t5/SAS-Forecasting-and-Econometrics/bd-p/forecasting_econometrics
Another better SPLINE effect to fit non-linear relationship is NATURALCUBIC spline.(also taken it from Rick's blog)
proc logistic data=have;
effect spl = spline(time/naturalcubic basis=tpf(noint) );
model fail(event='1') = spl / SCALE=NONE AGGREGATE;
effectplot fit(x=time)/obs ;
run;
I think you can plot a density estimate of the failures and successes and look for peaks (modes) of the density estimates. I'd use PROC UNIVARIATE and a CLASS statement for the failure flag. If both KDEs are approximately constant, then the quality does not exhibit "clusters" in the way that you describe.
I don't work with time-of-day data very often, but the following code might help you get started:
proc univariate data=have;
class fail;
histogram time / kernel(lower='0:00't upper='24:00't) nobars
endpoints=('0:00't to '24:00't by '1:00't);
/* how to format the X axis??? */
run ;
It isn't clear to me whether you should let the procedure choose the kernel bandwidth based on the data or figure out a bandwidth that you will always use. For example, compare the output of the about to the output from the KERNEL(C=0.5) option.
Thanks @Rick_SAS. That's an interesting idea to look at the distribution of failures and successes independently. Since my data sometimes has more trials during certain periods of the day, maybe I would look at the difference between those two distributions.
Maybe I should back up in my thinking. Instead of asking "find me the clusters," really the first question is "are there clusters?" Meaning, if I look at the data for the day, could I reject the null hypothesis that p(fail) is constant throughout the day? Maybe a naive way to do it would just be to pick arbitrary clusters (e.g. four 6-hour windows, or 8 three-hour windows), and run a ChiSquare test. Or some sort of iterative search grid to find the four clusters that have the maximum ChiSquare statistic (which leads me back to thinking about NLIN again.... : )
My general fear is that it's easy to look at a plot of failures by day and *think* you see meaningful clumps of failures. But of course even if you have data that is random, there will be "runs" in the data that look clumpy. And people can't intuitively know from eyeballing a chart whether an apparent clump is good evidence of a change in p(fail) during that time window, or just a clump that would happen randomly. Another way I start to think about it is "is the process poisson?" So if I had more data, I could cut the data into 24 one hour windows, and count the number of failures in each window. And if that was a poisson distribution, I think that would be evidence that p(fail) is constant for the day. (I guess that's similar to my ChiSquare idea). But if I had that much data, I'd probably just make a control chart (p-chart) with the data grouped by hour. That would be nice.
As a non-statistician, here's the approach I took about 10 years ago. There was clustering software available for purchase, but it was quite expensive. So here is my workaround approach.
Sum up each 30 minutes: total N and # failues.
Start at the beginning of the data and compute: what would happen if we added in the second 30 minutes to the first? Would the success rate be significantly different for the combination vs. the first? If so, the second segment begins a new cluster. If not, the first two segments combined become cluster #1. Keep on incorporating a segment until the cluster becomes different.
You get to fiddle with the segment size and the definition of "significantly different".
In the end, this approach produced very similar results to off-the-shelf software that cost hundreds of thousands of dollars. On the other hand, if you start at the end of the data and work backwards in time, the results would change dramatically. Also note, clusters are not defined "within the day". More like a "bucket" than a "cluster", a cluster could end up spanning more than a day or less than a day, and could include portions of two (or more) days. At any rate, perhaps this gives you some ideas you could apply.
Thanks @Astounding, as a fellow non-statistician, I can imagine an approach like that. I just worry that I'd be doing multiple comparisons (obviously), and would want to correct for that some how (Bonferroni correction of some sort), and wouldn't trust myself to correct appropriately. So if there is a PROC or obviously defensible way to do it, I'd be happy. But yes, if I can't do it that way, I may 'roll my own,' with some sort of iterative search like you suggest. It would be better than the eyeball test at least.
On most days I don't really have enough data to do iterative steps by hour, because I only have 30 data points for the day. But on some days when the failure rate is high (which are obviously the most important days to understand), they will over-sample. So on those days I might have enough data to test as you suggest. I might even have enough data to just make a control chart (p-chart) by hour, and that might be good enough to see if there are interesting within-day patterns.
@Quentin wrote:ChatGPT recommended looking into "temporal scan" methods which apparently create different time window sizes and then compare them. But I didn't find much on that approach that seemed helpful.
SAS has no spatial, temporal, and space-time scan statistics.
I have entered a feature request for this at SAS R&D in 2023.
For the temporal scan I had an IML program (written in 2011 or so), but I lost it due to retention policy for files.
Koen
@sbxkoenk wrote:
@Quentin wrote:ChatGPT recommended looking into "temporal scan" methods which apparently create different time window sizes and then compare them. But I didn't find much on that approach that seemed helpful.
SAS has no spatial, temporal, and space-time scan statistics.
I have entered a feature request for this at SAS R&D in 2023.
For the temporal scan I had an IML program (written in 2011 or so), but I lost it due to retention policy for files.
Koen
Thanks @sbxkoenk . I saw a blog post last night (can't find it now) which suggested there might be some of this in Enterprise Miner, but I don't have that. I do have SAS/QC and SAS/ETS, so was hoping there might be some PROC there that would be helpful. But when I asked Chat GPT, it also suggested writing a macro with IML code in it to do it by hand.
If you Google-search on:
, you will find useful stuff.
Michael Leonard of SAS Forecasting R&D (now retired) also once wrote a paper on this, but cannot find it. I believe there was an experimental HPF (High-Performance Forecasting) procedure for this (PROC HPFSEGMENT??), but that procedure never made it into the production software, I'm afraid.
Of course, if you have only "0" and "1" as possible time series values, then these pattern change detection methods might not be ideal. Temporal scan statistics can do the job, but SAS does not offer this (see another post of mine in this thread).
Koen
Thanks for the search terms! I've been googling "binomial cluster analysis time" and not finding that much. I haven't used ChatGPT much for programming but in this case it was surprisingly helpful in predicting what I might be talking about, and providing some interesting responses.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.