BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Daisy2
Obsidian | Level 7

I'm trying to split my data into Train (70%) and Test data sets.  My file has 508 rows, but the column I want to split only has about 350 cells filled.  The rest are empty.  I'm splitting the data through strata (3 cols) with simple random selection which seems to work, but it seems to be splitting the proportions from the 508 rows vs. the ~350 rows with actual data.  How do I get it to ignore the missing values in my data column "nce" for the split.  Here's my code. Thanks for the help.

/*partitioning data into 2 sets at 70% split with stratification*/
/* new column Selected = 0 is the 30% Test set, Selected = 1 is the 70% Train set */
proc surveyselect data=phrs2cut method=srs samprate= 0.7 out=StratSRS seed=12345 outall;
	samplingunit nce;
	strata loc2 ntrt2 pd2 notsorted / alloc=proportional stats;
	title "After Split";
run;
1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@Daisy2 wrote:
Thanks for the quick reply. After thinking about your question, I think it's most appropriate to do the former and not the latter - not use the records at all.
Where do I put your suggested code? Before Proc SurveySelect? (I'm pretty new to SAS) such as
Where not missing(nce);
proc surveyselect data=....??

or within the procedure?

In the procedure:

 

Proc surveyselect data=   <other proc options>;

   where not missing(nce);

  <remaing surveyselect statements>

 

Actual order is not critical but by placing it near the start of the procedure it is easier, IMHO, to remind me I was filtering data for some reason when I come back to the code later. In fact I would likely insert a comment such as

/* Do not want the missing values in this set to be considered as a Cluster for selection*/

 

 

View solution in original post

3 REPLIES 3
ballardw
Super User

@Daisy2 wrote:

I'm trying to split my data into Train (70%) and Test data sets.  My file has 508 rows, but the column I want to split only has about 350 cells filled.  The rest are empty.  I'm splitting the data through strata (3 cols) with simple random selection which seems to work, but it seems to be splitting the proportions from the 508 rows vs. the ~350 rows with actual data.  How do I get it to ignore the missing values in my data column "nce" for the split.  Here's my code. Thanks for the help.

/*partitioning data into 2 sets at 70% split with stratification*/
/* new column Selected = 0 is the 30% Test set, Selected = 1 is the 70% Train set */
proc surveyselect data=phrs2cut method=srs samprate= 0.7 out=StratSRS seed=12345 outall;
	samplingunit nce;
	strata loc2 ntrt2 pd2 notsorted / alloc=proportional stats;
	title "After Split";
run;

Please describe what you mean by "ignore the missing values"?

Do you mean not use the record at all, i.e. the source data to the ~350 observations with data for the variable? That could be done with a WHERE statement  such as: Where not missing(nce);

 

Not treat the missing value as level but include the records for selection?

From the documentation for Surveyselect:

PROC SURVEYSELECT treats missing values of STRATA and SAMPLINGUNIT variables like any other STRATA or SAMPLINGUNIT variable value. The missing values form a separate, valid variable level.

It looks like you may need to create an additional variable that has the properties for clustering that you want.

Daisy2
Obsidian | Level 7
Thanks for the quick reply. After thinking about your question, I think it's most appropriate to do the former and not the latter - not use the records at all.
Where do I put your suggested code? Before Proc SurveySelect? (I'm pretty new to SAS) such as
Where not missing(nce);
proc surveyselect data=....??

or within the procedure?
ballardw
Super User

@Daisy2 wrote:
Thanks for the quick reply. After thinking about your question, I think it's most appropriate to do the former and not the latter - not use the records at all.
Where do I put your suggested code? Before Proc SurveySelect? (I'm pretty new to SAS) such as
Where not missing(nce);
proc surveyselect data=....??

or within the procedure?

In the procedure:

 

Proc surveyselect data=   <other proc options>;

   where not missing(nce);

  <remaing surveyselect statements>

 

Actual order is not critical but by placing it near the start of the procedure it is easier, IMHO, to remind me I was filtering data for some reason when I come back to the code later. In fact I would likely insert a comment such as

/* Do not want the missing values in this set to be considered as a Cluster for selection*/

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 698 views
  • 3 likes
  • 2 in conversation