- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a large dataset with a binary target variable (0s and 1s). I'm looking to randomly split the data into a training, validation, and test set while maintaining the ratio of 0s and 1s across all datasets.
How would I do this or what procedures should I be looking into? I tried proc partition, but I don't have a CAS engine library setup (don't know how to check is one has been setup or how setup a session myself).
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Many model-selection routines in SAS enable you to split data by using the PARTITION statement. Examples include the "SELECT" procedures (GLMSELECT, QUANTSELECT, HPGENSELECT...) and the ADAPTIVEREG procedure.
If you want to create the data yourself, you use the DATA step to split the data randomly (which approximately preserves the proportion of 0/1), or you can use the GROUPS= option in the SURVEYSELECT procedure to specify the exact number of observations in each group.
Additional discussion and completely worked examples are available at "Create training, validation, and test data sets in SAS."
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a large dataset with a binary target variable (0s and 1s). I'm looking to randomly split the data into a training, validation, and test set while maintaining the ratio of 0s and 1s across all datasets.
This is a requirement that I am not aware of for most modeling. Normally, the data is split at random, and the ratios of 0s and 1s in each data set also is random. Why is it needed?
How would I do this or what procedures should I be looking into? I tried proc partition, but I don't have a CAS engine library setup
What parts of SAS do you have?
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Many model-selection routines in SAS enable you to split data by using the PARTITION statement. Examples include the "SELECT" procedures (GLMSELECT, QUANTSELECT, HPGENSELECT...) and the ADAPTIVEREG procedure.
If you want to create the data yourself, you use the DATA step to split the data randomly (which approximately preserves the proportion of 0/1), or you can use the GROUPS= option in the SURVEYSELECT procedure to specify the exact number of observations in each group.
Additional discussion and completely worked examples are available at "Create training, validation, and test data sets in SAS."