SAS Data Science

Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Viya (Machine Learning), SAS Visual Text Analytics, with point-and-click interfaces or programming
BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
GuyTreepwood
Obsidian | Level 7

Hello,

 

I am currently working on putting together a dataset for a classification model (with a standard binary outcome), and I have a general question regarding independence of observations.

 

The data I am working with is aggregated to the sales_id level, and it is joined to contract data from a different data source. A single contract (contract_id) could be found in multiple sales_ids. The stakeholder would like me to create a column to indicate whether the contract_id is found in other sales_ids. Another column is to check whether the contract was executed within 10 days of a previous contract record in another sales_id observation. The goal is to generate predictions at the sales_id level. 

 

sales_idcontract_idcontract_datecontract_amountcontract_in_other_orderdup_contract_within_10days
651423245612/1/202150010
561486245612/5/202150011
618234245612/31/202150010

 

Would engineering additional columns that checks other observations violate the assumption of independence of observations? If so, is this an issue for a classification model (like it is for linear regression models)? If this would cause problems, what remedies are available?

1 ACCEPTED SOLUTION

Accepted Solutions
sbxkoenk
SAS Super FREQ

Hello,

 

To see if you violate the independent observations assumption, you can plot residuals against any variables used in the technique (e.g., factors, regressors). A pattern that is not random suggests lack of independence.

 

Also, if you have an abundance of observations, do data splitting.
Make a training, a validation and a test set.
If your model holds up to independent out-of-sample observations (never seen by the model), then I think you are OK.

 

Cheers,

Koen

View solution in original post

1 REPLY 1
sbxkoenk
SAS Super FREQ

Hello,

 

To see if you violate the independent observations assumption, you can plot residuals against any variables used in the technique (e.g., factors, regressors). A pattern that is not random suggests lack of independence.

 

Also, if you have an abundance of observations, do data splitting.
Make a training, a validation and a test set.
If your model holds up to independent out-of-sample observations (never seen by the model), then I think you are OK.

 

Cheers,

Koen

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1219 views
  • 0 likes
  • 2 in conversation