BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
GuyTreepwood
Obsidian | Level 7

Hello,

 

I am currently working on putting together a dataset for a classification model (with a standard binary outcome), and I have a general question regarding independence of observations.

 

The data I am working with is aggregated to the sales_id level, and it is joined to contract data from a different data source. A single contract (contract_id) could be found in multiple sales_ids. The stakeholder would like me to create a column to indicate whether the contract_id is found in other sales_ids. Another column is to check whether the contract was executed within 10 days of a previous contract record in another sales_id observation. The goal is to generate predictions at the sales_id level. 

 

sales_idcontract_idcontract_datecontract_amountcontract_in_other_orderdup_contract_within_10days
651423245612/1/202150010
561486245612/5/202150011
618234245612/31/202150010

 

Would engineering additional columns that checks other observations violate the assumption of independence of observations? If so, is this an issue for a classification model (like it is for linear regression models)? If this would cause problems, what remedies are available?

1 ACCEPTED SOLUTION

Accepted Solutions
sbxkoenk
SAS Super FREQ

Hello,

 

To see if you violate the independent observations assumption, you can plot residuals against any variables used in the technique (e.g., factors, regressors). A pattern that is not random suggests lack of independence.

 

Also, if you have an abundance of observations, do data splitting.
Make a training, a validation and a test set.
If your model holds up to independent out-of-sample observations (never seen by the model), then I think you are OK.

 

Cheers,

Koen

View solution in original post

1 REPLY 1
sbxkoenk
SAS Super FREQ

Hello,

 

To see if you violate the independent observations assumption, you can plot residuals against any variables used in the technique (e.g., factors, regressors). A pattern that is not random suggests lack of independence.

 

Also, if you have an abundance of observations, do data splitting.
Make a training, a validation and a test set.
If your model holds up to independent out-of-sample observations (never seen by the model), then I think you are OK.

 

Cheers,

Koen

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 619 views
  • 0 likes
  • 2 in conversation