Problem statement I need to model contact inactivation, defined as a contact having 12 consecutive months with no touchpoints. At any given scoring date, contacts in the base can have different amounts of accumulated inactivity (for example, 1 month, 5 months, 8 months, etc.), i.e. they are partway toward the 12-month churn threshold. Objective I want the model to score contacts at any time and estimate the probability they will reach 12 months of inactivity within the next 12 months (or, equivalently, to churn within the next 12 months). Proposed approach and question I’m considering creating a dataset of snapshots (one per contact per prediction date), with a continuous feature “months_inactive_so_far” (N) and other historical features computed up to that snapshot. The label would be whether the contact reaches 12 months of inactivity within the subsequent 12 months. My question: Is this a reasonable way to prepare the training set, or are there better/principled alternatives (e.g., survival analysis or different labeling strategies)? Are there pitfalls I should watch for (censoring, leakage, splitting training/test by contact or time)? Any references or practical experience would be appreciated. Additional details that may help answerers (add if relevant) Are snapshots monthly, weekly, or event-driven? Do you have full 12 months of follow-up for all snapshots, or are recent snapshots right-censored? Do you need a single probability (will churn in next 12 months) or a time-to-event estimate (when will churn occur)?
... View more