- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am currently working on putting together a dataset for a classification model (with a standard binary outcome), and I have a general question regarding independence of observations.
The data I am working with is aggregated to the sales_id level, and it is joined to contract data from a different data source. A single contract (contract_id) could be found in multiple sales_ids. The stakeholder would like me to create a column to indicate whether the contract_id is found in other sales_ids. Another column is to check whether the contract was executed within 10 days of a previous contract record in another sales_id observation. The goal is to generate predictions at the sales_id level.
sales_id | contract_id | contract_date | contract_amount | contract_in_other_order | dup_contract_within_10days |
651423 | 2456 | 12/1/2021 | 500 | 1 | 0 |
561486 | 2456 | 12/5/2021 | 500 | 1 | 1 |
618234 | 2456 | 12/31/2021 | 500 | 1 | 0 |
Would engineering additional columns that checks other observations violate the assumption of independence of observations? If so, is this an issue for a classification model (like it is for linear regression models)? If this would cause problems, what remedies are available?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
To see if you violate the independent observations assumption, you can plot residuals against any variables used in the technique (e.g., factors, regressors). A pattern that is not random suggests lack of independence.
Also, if you have an abundance of observations, do data splitting.
Make a training, a validation and a test set.
If your model holds up to independent out-of-sample observations (never seen by the model), then I think you are OK.
Cheers,
Koen
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
To see if you violate the independent observations assumption, you can plot residuals against any variables used in the technique (e.g., factors, regressors). A pattern that is not random suggests lack of independence.
Also, if you have an abundance of observations, do data splitting.
Make a training, a validation and a test set.
If your model holds up to independent out-of-sample observations (never seen by the model), then I think you are OK.
Cheers,
Koen