Solved: Independence of observations for classification models

GuyTreepwood · Posted 01-28-2022 12:38 PM

Hello,

I am currently working on putting together a dataset for a classification model (with a standard binary outcome), and I have a general question regarding independence of observations.

The data I am working with is aggregated to the sales_id level, and it is joined to contract data from a different data source. A single contract (contract_id) could be found in multiple sales_ids. The stakeholder would like me to create a column to indicate whether the contract_id is found in other sales_ids. Another column is to check whether the contract was executed within 10 days of a previous contract record in another sales_id observation. The goal is to generate predictions at the sales_id level.

sales_id	contract_id	contract_date	contract_amount	contract_in_other_order	dup_contract_within_10days
651423	2456	12/1/2021	500	1	0
561486	2456	12/5/2021	500	1	1
618234	2456	12/31/2021	500	1	0

Would engineering additional columns that checks other observations violate the assumption of independence of observations? If so, is this an issue for a classification model (like it is for linear regression models)? If this would cause problems, what remedies are available?

sbxkoenk · Posted 01-28-2022 03:26 PM

Hello,

To see if you violate the independent observations assumption, you can plot residuals against any variables used in the technique (e.g., factors, regressors). A pattern that is not random suggests lack of independence.

Also, if you have an abundance of observations, do data splitting.
Make a training, a validation and a test set.
If your model holds up to independent out-of-sample observations (never seen by the model), then I think you are OK.

Cheers,

Koen

View solution in original post

sbxkoenk · Posted 01-28-2022 03:26 PM

Hello,

To see if you violate the independent observations assumption, you can plot residuals against any variables used in the technique (e.g., factors, regressors). A pattern that is not random suggests lack of independence.

Also, if you have an abundance of observations, do data splitting.
Make a training, a validation and a test set.
If your model holds up to independent out-of-sample observations (never seen by the model), then I think you are OK.

Cheers,

Koen

Independence of observations for classification models

Re: Independence of observations for classification models

Re: Independence of observations for classification models