BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cody_q
Calcite | Level 5

How do i split my dataset into 70% training , 30% testing ?         

Dear all ,

I have a dataset in csv format. I am looking for a way/tool to randomly done by dividing 70% of the database for training and 30% for testing , in order to guarantee that both subsets are random samples from the same distribution. I adopt 70% - 30% because it seems to be a common rule of thumb.

Any suggestions / methods / guide ?  or the use of EG ? EM ?

Thank you.


Regards,

YL

1 ACCEPTED SOLUTION

Accepted Solutions
Linlin
Lapis Lazuli | Level 10

after importing your csv file, try the code below:

data temp;

set sashelp.heart;

n=ranuni(8);

proc sort data=temp;

  by n;

  data training testing;

   set temp nobs=nobs;

   if _n_<=.7*nobs then output training;

    else output testing;

   run;

View solution in original post

11 REPLIES 11
Linlin
Lapis Lazuli | Level 10

after importing your csv file, try the code below:

data temp;

set sashelp.heart;

n=ranuni(8);

proc sort data=temp;

  by n;

  data training testing;

   set temp nobs=nobs;

   if _n_<=.7*nobs then output training;

    else output testing;

   run;

cody_q
Calcite | Level 5

Hi Linlin,

Where could i use your code on ? thanks

Reeza
Super User

SAS Enterprise Miner has a code node under the UTILITY tab.

You may also notice in the Data Partition node that there are 3 types of data sets, Training, Validation and Testing. You might want to clarify what you're after.

Depending on your data set size, you may want to consider a 70 - 20 -10 split or 60-30-10 split.  The Training and Validation datasets are used together to fit a model and the Testing is used solely for testing the final results. If you split your data manually, you might lose some of the automated testing features built into EM, specifically, how it trains and validates a model at the same time, and automatic model selection.

Hope that helps.

RichardinOz
Quartz | Level 8

LinLin's code can be simplified, eliminating both the temp data set and the sort (but not necessarily resulting in an exact 70:30 split) :

  data training testing;

   set imported ;

   if ranuni(&seed) >= 3 then output training;

    else output testing;

   run;

Specifying seed, any number you like, the division is repeatable.

Why do you want to save the data as a csv? You would only need to reimport to do your analysis

You can put SAS code into a code segment in EG, if you have it.

Richard

Reeza
Super User

What type of tool will you be using afterwards for analysis?

SAS EM has a node that will split your data and EG also has task that will split your data. If your using Base SAS then LinLin's code will work.

cody_q
Calcite | Level 5

Hi Reeza & LinLin,


Thank you for your reply.  I will be using EM.  i saw the data partitioning node but would like to save both the training and testing csv seperately. is there a way to do it  ? By the way, if i havent got Base SAS , how would i use the code which linlin shared.


Thank you

data_null__
Jade | Level 19
proc surveyselect data=sashelp.class rate=.3 outall out=class2;
   run;
The variable selected is 1 for the 30% sample 0 for the other 70%.
rayIII
SAS Employee

Sorry to be joining the discussion late (I chanced upon this thread while browsing...) but as of version 13.1, EM has a Save Data node that can save your partitions in various formats including csv. You just hook up the SD node downstream of your Partition node and choose where and how to save the data. 

 

Also, if you have EM then you definitely have all the capabilities of Base SAS.

 

Hope this is helpful to others who may visit this thread.

 

Ray

HRI
Calcite | Level 5 HRI
Calcite | Level 5

If you use SAS EM, you can use partition node to do this randomly split.

If you use SAS EG, you can use proce surveyselect to do this work. 

Both methods are simple and easy.

wolfpmd3
Fluorite | Level 6

This is an example, this divides a 20 obs data set into .7-.3 proportions:

 

/*The following steps will allow you to divide data into multiple Random subsets with a give proportion*/

data hello;
do i= 1 to 20;
output;
end;
run;

proc surveyselect data=hello out=chao method=srs samprate=.7 outall noprint;
run; /*this steps creates a new variable named "Selected" which can be referenced later on*/

data train;
set chao;
if selected = 1;
drop selected;
run;

data valid;
set chao;
if selected = 0;
drop selected;
run;

proc print data=train;
run;

proc print data=valid;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 53844 views
  • 18 likes
  • 9 in conversation