How do i split my dataset into 70% training , 30% testing ?
Dear all ,
I have a dataset in csv format. I am looking for a way/tool to randomly done by dividing 70% of the database for training and 30% for testing , in order to guarantee that both subsets are random samples from the same distribution. I adopt 70% - 30% because it seems to be a common rule of thumb.
Any suggestions / methods / guide ? or the use of EG ? EM ?
Thank you.
Regards,
YL
after importing your csv file, try the code below:
data temp;
set sashelp.heart;
n=ranuni(8);
proc sort data=temp;
by n;
data training testing;
set temp nobs=nobs;
if _n_<=.7*nobs then output training;
else output testing;
run;
after importing your csv file, try the code below:
data temp;
set sashelp.heart;
n=ranuni(8);
proc sort data=temp;
by n;
data training testing;
set temp nobs=nobs;
if _n_<=.7*nobs then output training;
else output testing;
run;
Hi Linlin,
Where could i use your code on ? thanks
SAS Enterprise Miner has a code node under the UTILITY tab.
You may also notice in the Data Partition node that there are 3 types of data sets, Training, Validation and Testing. You might want to clarify what you're after.
Depending on your data set size, you may want to consider a 70 - 20 -10 split or 60-30-10 split. The Training and Validation datasets are used together to fit a model and the Testing is used solely for testing the final results. If you split your data manually, you might lose some of the automated testing features built into EM, specifically, how it trains and validates a model at the same time, and automatic model selection.
Hope that helps.
LinLin's code can be simplified, eliminating both the temp data set and the sort (but not necessarily resulting in an exact 70:30 split) :
data training testing;
set imported ;
if ranuni(&seed) >= 3 then output training;
else output testing;
run;
Specifying seed, any number you like, the division is repeatable.
Why do you want to save the data as a csv? You would only need to reimport to do your analysis
You can put SAS code into a code segment in EG, if you have it.
Richard
What type of tool will you be using afterwards for analysis?
SAS EM has a node that will split your data and EG also has task that will split your data. If your using Base SAS then LinLin's code will work.
Hi Reeza & LinLin,
Thank you for your reply. I will be using EM. i saw the data partitioning node but would like to save both the training and testing csv seperately. is there a way to do it ? By the way, if i havent got Base SAS , how would i use the code which linlin shared.
Thank you
Sorry to be joining the discussion late (I chanced upon this thread while browsing...) but as of version 13.1, EM has a Save Data node that can save your partitions in various formats including csv. You just hook up the SD node downstream of your Partition node and choose where and how to save the data.
Also, if you have EM then you definitely have all the capabilities of Base SAS.
Hope this is helpful to others who may visit this thread.
Ray
If you use SAS EM, you can use partition node to do this randomly split.
If you use SAS EG, you can use proce surveyselect to do this work.
Both methods are simple and easy.
This is an example, this divides a 20 obs data set into .7-.3 proportions:
/*The following steps will allow you to divide data into multiple Random subsets with a give proportion*/
data hello;
do i= 1 to 20;
output;
end;
run;
proc surveyselect data=hello out=chao method=srs samprate=.7 outall noprint;
run; /*this steps creates a new variable named "Selected" which can be referenced later on*/
data train;
set chao;
if selected = 1;
drop selected;
run;
data valid;
set chao;
if selected = 0;
drop selected;
run;
proc print data=train;
run;
proc print data=valid;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.