BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cody_q
Calcite | Level 5

Dear all ,

I have a dataset in csv format. I am looking for a way/tool to randomly done by dividing 70% of the database for training and 30% for testing , in order to guarantee that both subsets are random samples from the same distribution. I adopt 70% - 30% because it seems to be a common rule of thumb.

Any suggestions / methods / guide ?  or the use of EG ? EM ?

Thank you.


Regards,

YL

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Simply add

if ranuni() < 0.7 then set="TRAINING";

else set = "TESTING";

to create a new variable as you read your dataset.

PG

PG

View solution in original post

5 REPLIES 5
PGStats
Opal | Level 21

Simply add

if ranuni() < 0.7 then set="TRAINING";

else set = "TESTING";

to create a new variable as you read your dataset.

PG

PG
cody_q
Calcite | Level 5

Hi PGStats ,


How could i use the above code to create new varaible ?

Thanks

PGStats
Opal | Level 21

Those statements would be added to a datastep to create a new character variable called set that would take the value TRAINING randomly for 70% of observations and the value TESTING otherwise.

PG

PG
jaredp
Quartz | Level 8

Well, if you have EM, then splitting the data into Training and Testing is trivial.  The feature is a default feature when creating your SAS data in EM.  You can also use a Data Partition Node.

Astounding
PROC Star

If you're really interested in splitting a csv file into two csv files, there is no need to create a SAS data set along the way.  Here's one approach:

filename csvfile 'path to existing csv file';

filename train 'path to a training subset';

filename test 'path to a testing subset';

data _null_;

  infile csvfile;

  input @;

  if ranuni(12345) < 0.7 then file train;

  else file test;

   put _infile_;

run;

The drawback is that you will get approximately 70/30, not exact.  If you really want to create a SAS data set from the csv file first, there are many alternatives including PROC SURVEYSELECT.

Good luck.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 7670 views
  • 3 likes
  • 4 in conversation