Hello,
I am currently working on putting together a dataset for a classification model (with a standard binary outcome), and I have a general question regarding independence of observations.
The data I am working with is aggregated to the sales_id level, and it is joined to contract data from a different data source. A single contract (contract_id) could be found in multiple sales_ids. The stakeholder would like me to create a column to indicate whether the contract_id is found in other sales_ids. Another column is to check whether the contract was executed within 10 days of a previous contract record in another sales_id observation. The goal is to generate predictions at the sales_id level.
sales_id | contract_id | contract_date | contract_amount | contract_in_other_order | dup_contract_within_10days |
651423 | 2456 | 12/1/2021 | 500 | 1 | 0 |
561486 | 2456 | 12/5/2021 | 500 | 1 | 1 |
618234 | 2456 | 12/31/2021 | 500 | 1 | 0 |
Would engineering additional columns that checks other observations violate the assumption of independence of observations? If so, is this an issue for a classification model (like it is for linear regression models)? If this would cause problems, what remedies are available?
Hello,
To see if you violate the independent observations assumption, you can plot residuals against any variables used in the technique (e.g., factors, regressors). A pattern that is not random suggests lack of independence.
Also, if you have an abundance of observations, do data splitting.
Make a training, a validation and a test set.
If your model holds up to independent out-of-sample observations (never seen by the model), then I think you are OK.
Cheers,
Koen
Hello,
To see if you violate the independent observations assumption, you can plot residuals against any variables used in the technique (e.g., factors, regressors). A pattern that is not random suggests lack of independence.
Also, if you have an abundance of observations, do data splitting.
Make a training, a validation and a test set.
If your model holds up to independent out-of-sample observations (never seen by the model), then I think you are OK.
Cheers,
Koen
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.