Dear all ,
I have a dataset in csv format. I am looking for a way/tool to randomly done by dividing 70% of the database for training and 30% for testing , in order to guarantee that both subsets are random samples from the same distribution. I adopt 70% - 30% because it seems to be a common rule of thumb.
Any suggestions / methods / guide ? or the use of EG ? EM ?
Thank you.
Regards,
YL
Simply add
if ranuni() < 0.7 then set="TRAINING";
else set = "TESTING";
to create a new variable as you read your dataset.
PG
Simply add
if ranuni() < 0.7 then set="TRAINING";
else set = "TESTING";
to create a new variable as you read your dataset.
PG
Hi PGStats ,
How could i use the above code to create new varaible ?
Thanks
Those statements would be added to a datastep to create a new character variable called set that would take the value TRAINING randomly for 70% of observations and the value TESTING otherwise.
PG
Well, if you have EM, then splitting the data into Training and Testing is trivial. The feature is a default feature when creating your SAS data in EM. You can also use a Data Partition Node.
If you're really interested in splitting a csv file into two csv files, there is no need to create a SAS data set along the way. Here's one approach:
filename csvfile 'path to existing csv file';
filename train 'path to a training subset';
filename test 'path to a testing subset';
data _null_;
infile csvfile;
input @;
if ranuni(12345) < 0.7 then file train;
else file test;
put _infile_;
run;
The drawback is that you will get approximately 70/30, not exact. If you really want to create a SAS data set from the csv file first, there are many alternatives including PROC SURVEYSELECT.
Good luck.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.