Solved
Contributor
Posts: 24

# How do i split my dataset into 70% training , 30% testing ?

Dear all ,

I have a dataset in csv format. I am looking for a way/tool to randomly done by dividing 70% of the database for training and 30% for testing , in order to guarantee that both subsets are random samples from the same distribution. I adopt 70% - 30% because it seems to be a common rule of thumb.

Any suggestions / methods / guide ?  or the use of EG ? EM ?

Thank you.

Regards,

YL

Accepted Solutions
Solution
‎03-11-2013 12:12 AM
Super Contributor
Posts: 1,636

## Re: How do i split my dataset into 70% training , 30% testing ?

after importing your csv file, try the code below:

data temp;

set sashelp.heart;

n=ranuni(8);

proc sort data=temp;

by n;

data training testing;

set temp nobs=nobs;

if _n_<=.7*nobs then output training;

else output testing;

run;

All Replies
Solution
‎03-11-2013 12:12 AM
Super Contributor
Posts: 1,636

## Re: How do i split my dataset into 70% training , 30% testing ?

after importing your csv file, try the code below:

data temp;

set sashelp.heart;

n=ranuni(8);

proc sort data=temp;

by n;

data training testing;

set temp nobs=nobs;

if _n_<=.7*nobs then output training;

else output testing;

run;

Contributor
Posts: 24

## Re: How do i split my dataset into 70% training , 30% testing ?

Hi Linlin,

Where could i use your code on ? thanks

Super User
Posts: 24,004

## Re: How do i split my dataset into 70% training , 30% testing ?

SAS Enterprise Miner has a code node under the UTILITY tab.

You may also notice in the Data Partition node that there are 3 types of data sets, Training, Validation and Testing. You might want to clarify what you're after.

Depending on your data set size, you may want to consider a 70 - 20 -10 split or 60-30-10 split.  The Training and Validation datasets are used together to fit a model and the Testing is used solely for testing the final results. If you split your data manually, you might lose some of the automated testing features built into EM, specifically, how it trains and validates a model at the same time, and automatic model selection.

Hope that helps.

Super Contributor
Posts: 644

## Re: How do i split my dataset into 70% training , 30% testing ?

LinLin's code can be simplified, eliminating both the temp data set and the sort (but not necessarily resulting in an exact 70:30 split) :

data training testing;

set imported ;

if ranuni(&seed) >= 3 then output training;

else output testing;

run;

Specifying seed, any number you like, the division is repeatable.

Why do you want to save the data as a csv? You would only need to reimport to do your analysis

You can put SAS code into a code segment in EG, if you have it.

Richard

New Contributor
Posts: 2

Super User
Posts: 24,004

## Re: How do i split my dataset into 70% training , 30% testing ?

What type of tool will you be using afterwards for analysis?

SAS EM has a node that will split your data and EG also has task that will split your data. If your using Base SAS then LinLin's code will work.

Contributor
Posts: 24

## Re: How do i split my dataset into 70% training , 30% testing ?

Hi Reeza & LinLin,

Thank you for your reply.  I will be using EM.  i saw the data partitioning node but would like to save both the training and testing csv seperately. is there a way to do it  ? By the way, if i havent got Base SAS , how would i use the code which linlin shared.

Thank you

Posts: 3,867

## Re: How do i split my dataset into 70% training , 30% testing ?

proc surveyselect data=sashelp.class rate=.3 outall out=class2;
run;
The variable selected is 1 for the 30% sample 0 for the other 70%.
SAS Employee
Posts: 106

## Re: How do i split my dataset into 70% training , 30% testing ?

Sorry to be joining the discussion late (I chanced upon this thread while browsing...) but as of version 13.1, EM has a Save Data node that can save your partitions in various formats including csv. You just hook up the SD node downstream of your Partition node and choose where and how to save the data.

Also, if you have EM then you definitely have all the capabilities of Base SAS.

Ray

New Contributor
Posts: 2

## Re: How do i split my dataset into 70% training , 30% testing ?

If you use SAS EM, you can use partition node to do this randomly split.

If you use SAS EG, you can use proce surveyselect to do this work.

Both methods are simple and easy.

New Contributor
Posts: 3

## Re: How do i split my dataset into 70% training , 30% testing ?

This is an example, this divides a 20 obs data set into .7-.3 proportions:

/*The following steps will allow you to divide data into multiple Random subsets with a give proportion*/

data hello;
do i= 1 to 20;
output;
end;
run;

proc surveyselect data=hello out=chao method=srs samprate=.7 outall noprint;
run; /*this steps creates a new variable named "Selected" which can be referenced later on*/

data train;
set chao;
if selected = 1;
drop selected;
run;

data valid;
set chao;
if selected = 0;
drop selected;
run;

proc print data=train;
run;

proc print data=valid;
run;

☑ This topic is solved.