topic How do i split my dataset into 70% training , 30% testing ? in SAS Data Science

How do i split my dataset into 70% training , 30% testing ?

cody_q — Mon, 11 Mar 2013 02:27:05 GMT

Dear all ,

I have a dataset in csv format. I am looking for a way/tool to randomly done by dividing 70% of the database for training and 30% for testing , in order to guarantee that both subsets are random samples from the same distribution. I adopt 70% - 30% because it seems to be a common rule of thumb.

Any suggestions / methods / guide ? or the use of EG ? EM ?

Thank you.

Regards,

Re: How do i split my dataset into 70% training , 30% testing ?

PGStats — Mon, 11 Mar 2013 02:53:12 GMT

Simply add

if ranuni() < 0.7 then set="TRAINING";

else set = "TESTING";

to create a new variable as you read your dataset.

Re: How do i split my dataset into 70% training , 30% testing ?

cody_q — Mon, 11 Mar 2013 10:20:17 GMT

Hi PGStats ,

How could i use the above code to create new varaible ?

Thanks

Re: How do i split my dataset into 70% training , 30% testing ?

PGStats — Mon, 11 Mar 2013 14:01:22 GMT

Those statements would be added to a datastep to create a new character variable called set that would take the value TRAINING randomly for 70% of observations and the value TESTING otherwise.

Re: How do i split my dataset into 70% training , 30% testing ?

jaredp — Tue, 09 Apr 2013 15:22:06 GMT

Well, if you have EM, then splitting the data into Training and Testing is trivial. The feature is a default feature when creating your SAS data in EM. You can also use a Data Partition Node.

Re: How do i split my dataset into 70% training , 30% testing ?

Astounding — Tue, 09 Apr 2013 16:02:01 GMT

If you're really interested in splitting a csv file into two csv files, there is no need to create a SAS data set along the way. Here's one approach:

filename csvfile 'path to existing csv file';

filename train 'path to a training subset';

filename test 'path to a testing subset';

data _null_;

infile csvfile;

input @;

if ranuni(12345) < 0.7 then file train;

else file test;

put _infile_;

run;

The drawback is that you will get approximately 70/30, not exact. If you really want to create a SAS data set from the csv file first, there are many alternatives including PROC SURVEYSELECT.

Good luck.