Help using Base SAS procedures

How to split data into train and test sets, and use the model built from train set to predict data in test set?

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

How to split data into train and test sets, and use the model built from train set to predict data in test set?

Hi! I am a junior SAS analyst.

I intend to split data into train and test sets, and use the model built from train set to predict data in test set, the number of observation is up to 50000 or more.

the easiest way that I think of is to use the syntax "PROC SURVEYSELECT" to random-sample observations from whole data. For example,

I may ask SAS to random-sample 30% as test set, (and the rest 70% is train set):

PROC SURVEYSELECT DATA=whole.data OUT=test.set METHOD=srs SAMPRATE=0.3;

RUN;

Now, I have a test set in the dataset: 'test.set', however:

1.how could I create a dataset (e.g. 'train.set') to accommodate the rest 70% data?

2.After using 'train.set' to build a predictive model  (e.g. linear model), how could I use this model built in the 'train.set' to

  predict data in the 'test.set'? and let the output revealing every predicted value and residual?

Thanks for your patience!

David


Accepted Solutions
Solution
‎11-24-2014 11:09 AM
Trusted Advisor
Posts: 1,204

Re: How to split data into train and test sets, and use the model built from train set to predict data in test set?

Hi,

Just try the syntax given above. Flag variable "selected" will be created in the data set "all". Outall is part of syntax and "all" is the resultant data set.

View solution in original post


All Replies
Trusted Advisor
Posts: 1,204

Re: How to split data into train and test sets, and use the model built from train set to predict data in test set?

Hi,

Just add outall in the syntax to create a dataset all that adds a flag variable "selected" which is 1 for test sample and 0 for remaining observations which may be considered as training set. So you can use selected=0 as a training dataset for the model development and selected=1 for testing.

PROC SURVEYSELECT DATA=whole.data outall OUT=all METHOD=srs SAMPRATE=0.3;

RUN;

New Contributor
Posts: 4

Re: How to split data into train and test sets, and use the model built from train set to predict data in test set?

Hi! Thanks for your prompt reply!!

But I still have some questions:

1.How to make "a flag variable: selected"? and assign values '1' and '0'?

2.Is 'outall' a syntax or just a nominal name?

If convenience, hope that you can share the detailed procedures.

Sorry, I am not accustomed to data management.

Many thanks!

David

Solution
‎11-24-2014 11:09 AM
Trusted Advisor
Posts: 1,204

Re: How to split data into train and test sets, and use the model built from train set to predict data in test set?

Hi,

Just try the syntax given above. Flag variable "selected" will be created in the data set "all". Outall is part of syntax and "all" is the resultant data set.

New Contributor
Posts: 4

Re: How to split data into train and test sets, and use the model built from train set to predict data in test set?

I am so glad for your kindness!

David

New Contributor
Posts: 4

Re: How to split data into train and test sets, and use the model built from train set to predict data in test set?

Hi!

I have successfully split the whole data into two parts: train set and test set, and I use the syntax

PROC FREQ to check whether they are split as the proportion I need, and it's done! Thanks

Now, I have used the train set (only 'selected=0' data are used) to build a linear model, and estimate the BETAs,

however, I do not know how to use this selected MODEL to predict data in the test set?

IN BRIEF, how to use a selected model to predict (or validate) data in test set?

warm regards

David

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 3717 views
  • 2 likes
  • 2 in conversation