05-04-2013 09:43 AM
I am currently working on an assignment for college, where I need to create a predictive data mining model to determine sickness in patients.
I've been working around with regression so far (will also do decision trees and neural networks) and I just figured out something. When I added the data partition node, I set the sampling method as the default: simple random.
I do have several variables in my dataset such as age (interval), sex (binary) and others. I was wondering if I should modify my data partition node as stratified and use one or more of those variables (eg: sex, perhaps also age).
I realise that this might be a case of it is up to you, but I would really like to get form the community some advice around it. I mean, why would I do simple random or stratified, how could I make a judgement on what method to use, and so forth.
Any thoughts around this?
Thanks in advance for the help.
05-04-2013 01:07 PM
You generally stratify when you believe the population is significantly different. For example if you were looking at physical health characteristics, males and females could be very diffferent so you could stratify on that variable. One method to determine variables to stratify on, is to do your regression and see what variables are significant, those are variables that might be worth stratifying on.
My 2 cents
05-04-2013 01:28 PM
Thank you very much for your feedback. These are good points which I am certainly taking on board.
In this assignment I am performing only the first cycle of data mining. But this is valuable information for my recommendations.