BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cody_q
Calcite | Level 5

Dear all ,

I have a dataset in csv format. I am looking for a way/tool to randomly done by dividing 70% of the database for training and 30% for testing , in order to guarantee that both subsets are random samples from the same distribution. I adopt 70% - 30% because it seems to be a common rule of thumb.

Any suggestions / methods / guide ?  or the use of EG ? EM ?

Thank you.


Regards,

YL

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Simply add

if ranuni() < 0.7 then set="TRAINING";

else set = "TESTING";

to create a new variable as you read your dataset.

PG

PG

View solution in original post

5 REPLIES 5
PGStats
Opal | Level 21

Simply add

if ranuni() < 0.7 then set="TRAINING";

else set = "TESTING";

to create a new variable as you read your dataset.

PG

PG
cody_q
Calcite | Level 5

Hi PGStats ,


How could i use the above code to create new varaible ?

Thanks

PGStats
Opal | Level 21

Those statements would be added to a datastep to create a new character variable called set that would take the value TRAINING randomly for 70% of observations and the value TESTING otherwise.

PG

PG
jaredp
Quartz | Level 8

Well, if you have EM, then splitting the data into Training and Testing is trivial.  The feature is a default feature when creating your SAS data in EM.  You can also use a Data Partition Node.

Astounding
PROC Star

If you're really interested in splitting a csv file into two csv files, there is no need to create a SAS data set along the way.  Here's one approach:

filename csvfile 'path to existing csv file';

filename train 'path to a training subset';

filename test 'path to a testing subset';

data _null_;

  infile csvfile;

  input @;

  if ranuni(12345) < 0.7 then file train;

  else file test;

   put _infile_;

run;

The drawback is that you will get approximately 70/30, not exact.  If you really want to create a SAS data set from the csv file first, there are many alternatives including PROC SURVEYSELECT.

Good luck.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 7215 views
  • 3 likes
  • 4 in conversation