How do i split my dataset into 70% training , 30% testing ?

Accepted Solution Solved
Reply
Contributor
Posts: 24
Accepted Solution

How do i split my dataset into 70% training , 30% testing ?

How do i split my dataset into 70% training , 30% testing ?         

Dear all ,

I have a dataset in csv format. I am looking for a way/tool to randomly done by dividing 70% of the database for training and 30% for testing , in order to guarantee that both subsets are random samples from the same distribution. I adopt 70% - 30% because it seems to be a common rule of thumb.

Any suggestions / methods / guide ?  or the use of EG ? EM ?

Thank you.


Regards,

YL


Accepted Solutions
Solution
‎03-11-2013 12:12 AM
Super Contributor
Posts: 1,636

Re: How do i split my dataset into 70% training , 30% testing ?

after importing your csv file, try the code below:

data temp;

set sashelp.heart;

n=ranuni(8);

proc sort data=temp;

  by n;

  data training testing;

   set temp nobs=nobs;

   if _n_<=.7*nobs then output training;

    else output testing;

   run;

View solution in original post


All Replies
Solution
‎03-11-2013 12:12 AM
Super Contributor
Posts: 1,636

Re: How do i split my dataset into 70% training , 30% testing ?

after importing your csv file, try the code below:

data temp;

set sashelp.heart;

n=ranuni(8);

proc sort data=temp;

  by n;

  data training testing;

   set temp nobs=nobs;

   if _n_<=.7*nobs then output training;

    else output testing;

   run;

Contributor
Posts: 24

Re: How do i split my dataset into 70% training , 30% testing ?

Hi Linlin,

Where could i use your code on ? thanks

Grand Advisor
Posts: 17,360

Re: How do i split my dataset into 70% training , 30% testing ?

SAS Enterprise Miner has a code node under the UTILITY tab.

You may also notice in the Data Partition node that there are 3 types of data sets, Training, Validation and Testing. You might want to clarify what you're after.

Depending on your data set size, you may want to consider a 70 - 20 -10 split or 60-30-10 split.  The Training and Validation datasets are used together to fit a model and the Testing is used solely for testing the final results. If you split your data manually, you might lose some of the automated testing features built into EM, specifically, how it trains and validates a model at the same time, and automatic model selection.

Hope that helps.

Super Contributor
Posts: 644

Re: How do i split my dataset into 70% training , 30% testing ?

LinLin's code can be simplified, eliminating both the temp data set and the sort (but not necessarily resulting in an exact 70:30 split) :

  data training testing;

   set imported ;

   if ranuni(&seed) >= 3 then output training;

    else output testing;

   run;

Specifying seed, any number you like, the division is repeatable.

Why do you want to save the data as a csv? You would only need to reimport to do your analysis

You can put SAS code into a code segment in EG, if you have it.

Richard

New Contributor
Posts: 2

Re: How do i split my dataset into 70% training , 30% testing ?

Great answer!
Grand Advisor
Posts: 17,360

Re: How do i split my dataset into 70% training , 30% testing ?

What type of tool will you be using afterwards for analysis?

SAS EM has a node that will split your data and EG also has task that will split your data. If your using Base SAS then LinLin's code will work.

Contributor
Posts: 24

Re: How do i split my dataset into 70% training , 30% testing ?

Hi Reeza & LinLin,


Thank you for your reply.  I will be using EM.  i saw the data partitioning node but would like to save both the training and testing csv seperately. is there a way to do it  ? By the way, if i havent got Base SAS , how would i use the code which linlin shared.


Thank you

Respected Advisor
Posts: 3,775

Re: How do i split my dataset into 70% training , 30% testing ?

proc surveyselect data=sashelp.class rate=.3 outall out=class2;
   run;
The variable selected is 1 for the 30% sample 0 for the other 70%.
SAS Employee
Posts: 106

Re: How do i split my dataset into 70% training , 30% testing ?

Sorry to be joining the discussion late (I chanced upon this thread while browsing...) but as of version 13.1, EM has a Save Data node that can save your partitions in various formats including csv. You just hook up the SD node downstream of your Partition node and choose where and how to save the data. 

 

Also, if you have EM then you definitely have all the capabilities of Base SAS.

 

Hope this is helpful to others who may visit this thread.

 

Ray

New Contributor HRI
New Contributor
Posts: 2

Re: How do i split my dataset into 70% training , 30% testing ?

If you use SAS EM, you can use partition node to do this randomly split.

If you use SAS EG, you can use proce surveyselect to do this work. 

Both methods are simple and easy.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 15454 views
  • 11 likes
  • 8 in conversation