Dear all ,
I have a dataset in csv format. I am looking for a way/tool to randomly done by dividing 70% of the database for training and 30% for testing , in order to guarantee that both subsets are random samples from the same distribution. I adopt 70% - 30% because it seems to be a common rule of thumb.
Any suggestions / methods / guide ? or the use of EG ? EM ?
Thank you.
Regards,
YL
Simply add
if ranuni() < 0.7 then set="TRAINING";
else set = "TESTING";
to create a new variable as you read your dataset.
PG
Simply add
if ranuni() < 0.7 then set="TRAINING";
else set = "TESTING";
to create a new variable as you read your dataset.
PG
Hi PGStats ,
How could i use the above code to create new varaible ?
Thanks
Those statements would be added to a datastep to create a new character variable called set that would take the value TRAINING randomly for 70% of observations and the value TESTING otherwise.
 
PG
Well, if you have EM, then splitting the data into Training and Testing is trivial. The feature is a default feature when creating your SAS data in EM. You can also use a Data Partition Node.
If you're really interested in splitting a csv file into two csv files, there is no need to create a SAS data set along the way. Here's one approach:
filename csvfile 'path to existing csv file';
filename train 'path to a training subset';
filename test 'path to a testing subset';
data _null_;
infile csvfile;
input @;
if ranuni(12345) < 0.7 then file train;
else file test;
put _infile_;
run;
The drawback is that you will get approximately 70/30, not exact. If you really want to create a SAS data set from the csv file first, there are many alternatives including PROC SURVEYSELECT.
Good luck.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.
