BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
delgaa07
Calcite | Level 5

Hi everyone,

 

I am in healthcare and am having a hard time figuring out how to split the data into a Training and Validation (70/30) sets where important covariates, such as Sex are balanced; I have 2 other categorical variables with contrast dummy codes that need to be balance as well after the split. Ultimately I will be putting this data through an elastic net regression analysis.

 

Do date, I've used PROC GLMSELECT but don't know if I can use PROC SURVEYSELECT or a DATA step. 

 

can anyone point me in the right direction?

1 ACCEPTED SOLUTION

Accepted Solutions
Watts
SAS Employee

This following PROC SURVEYSELECT code splits the data set into two groups (70% and 30%) and maintains the 70/30 distribution in the subgroups ('Sex' and 2 other categorical variables).

 

proc surveyselect data=have rate=0.70 outall out=result; 
strata Sex Var1 Var2;
run;

The OUTALL option includes all observations (both Training and Validation) in the output data set. The variable Selected is 1 for observations in the Training group (70%) and 0 for observations in the Validation group (30%). 

View solution in original post

9 REPLIES 9
PaigeMiller
Diamond | Level 26

Do you mean that you want 70% of the males to be in the training set and 30% in the validation set? What happens if 70% of the males is not an integer?

 

Do you mean that you want 70% of the females to be in the training set and 30% in the validation set? What happens if 70% of the females is not an integer?

 

 

--
Paige Miller
delgaa07
Calcite | Level 5

Yes exactly. I am wanting roughly the same amount of males/females in the training and validation sets. It' not completely necessary for these to be an integer, overall I want to see if they can be roughly equal. 

PaigeMiller
Diamond | Level 26

For one variable, like gender, you probably can so this SURVEYSELECT but as I am not that familiar with that procedure, it's pretty easy to do with a data step and maybe PROC RANK.

 

In a data step, assign random numbers to all of the subjects. Then use PROC RANK, with a BY GENDER; statement to rank the random numbers on a scale from 0 to 1 (the FRACTION option does this), and then any male or female whose rank fraction is less than 0.7 goes into the training group.

 

For multiple categorical variables, you just need to use BY GENDER VAR1 VAR2; and do the ranking with this 3-way group of variables. 

--
Paige Miller
delgaa07
Calcite | Level 5

I'll definitely give this a shot. 

 

Thank you for helping!!

mkeintz
PROC Star

Do you want them balanced 70/30 at the marginal frequency levels (i.e. at the totals by sex, totals by categorical var 1, totals by categorical var 2)?  Or do you want them balanced at every cross classification level?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
delgaa07
Calcite | Level 5

I would preferably like to have them balanced at every cross classification level.

 

Thanks, in advanced, for all your help!

mkeintz
PROC Star

I'm not familiar with GLMSELECT or SURVEYSELECT.  Here is a program that randomly assigns exactly 70% in training group for each cross-classification of SEX/CATVAR1/CATVAR2 whenever possible (i.e. whenever 70% of the cell count is an exact integer).

 

When 70% is not an exact integer the "extra" observation is randomly assigned to one or the other group.  The odds for this observation is a randomized function, based on how far above an integer the exact 70% is, as here:

 

Cell
Size
Min N
Training
Min N
(Validation)
Prob (Extra Obs=> Training)
10 7 3 No extra
11 7 3 0.70
12 8 3 0.40
13 9 3 0.10
14 9 4 0.80
15 10 4 0.50
16 11 4 0.20
17 11 5 0.90
18 12 5 0.60
19 13 5 0.30
20 14 6 No extra

 

 

The downside to the code below is that it requires a grasp of hash objects (for this case, think "lookup tables" stored in memory).  There will be one lookup table (named "h" in each case) for each sex/catvar1/catvar2 combination.  To always have to correct h in hand, there is a hash-of-hashes object ('hoh' below) that contains pointers to each h, based upon the values of sex/catvar1/catvar2:  

 

data have ;
  do sex='M','F';
    do catvar1=1,2,3;
	  do catvar2='A','B','C';
	    do until (mod(id,11)=0);
		  id+1;
		  output;
		end;
	  end;
	end;
  end;
run;

data dummy; 
  set have;
  length randnum 8;
  stop;
run;

data training (where=(flag='Training'))
     validation (where=(flag='Valid'));
  set have end=end_of_have;

  call streaminit(156667);
  randnum=rand('uniform');

  if _n_=1 then do;
    /* Hash of hashes object to point to the correct 
       object "h" for each sex/catvar1/catvar2 combo*/

    declare hash hoh (ordered:"A");
	  hoh.definekey('sex','catvar1','catvar2');
	  hoh.definedata('sex','catvar1','catvar2','h','hi');
	  hoh.definedone();
	declare hiter hohi ('hoh');
	declare hash h;    /*Don't instantiate, but reserve the name */
	declare hiter hi;  /*Don't instantiate, but reserve the name */
  end;

  _rc=hoh.find();      /* Load the corresponding hash H and hiter HI */
  if _rc^=0 then do;   /* If no such H/HI, then instantiate them */
    h=_new_ hash(ordered:'a',dataset:'dummy');
	  h.definekey('randnum');
	  h.definedata(all:'Y');
	  h.definedone();
	hi=_new_ hiter('h');
	hoh.add();
  end;

  h.add();  /* Add each obs to appropriate h */


  /* Step through all of the hashes identified in hoh */
  /* In each hash h, get its size (num_items) to calculate 70% */

  if end_of_have then do while (hohi.next()=0);
    expected_training=.7*h.num_items;
    do _i=1 by 1 while (hi.next()=0);
	  if _i - rand('uniform') <= expected_training then flag='Training';
	  else flag='Valid';
	  output;
	end;
  end;

  drop _: ;
run;

I created a test dataset HAVE with 11 cases in each call above (see the "mod(id,11)=0" function).

 

The empty dataset DUMMY is there just to put the variable RANDNUM into its header.  This makes for simpler syntax in instantiating hash objects h, because I don't have to individually list each variable in the h.definedata statement.

 

When reading have, the program sticks observations into the appropriate "h", but they are now randomly ordered.

 

At the end of reading dataset HAVE, the program steps through each object and outputs every retrieved "row".  The first batch if flagged for Training (with random assignment of the "extra" obs).  The rest are flagged for Validation.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
FreelanceReinh
Jade | Level 19

Hi @delgaa07,

 

I would probably use PROC SURVEYSELECT:

proc surveyselect data=have
method=srs rate=30
seed=2718 out=valset(drop=selectionprob samplingweight);
strata sex var2 var3;
run;

This would select the validation set (30% sample) as the union of 30% samples from each stratum (=subset defined by the combination of values of the three categorical variables in the STRATA statement). Make sure to include a unique ID for each observation (e.g., an observation number if there is no "natural" ID) so that you can obtain the training set "HAVE minus VALSET" by a simple MERGE step.

 

Even without the STRATA statement the proportions of the strata will be roughly balanced between the two subsets in most cases (if the strata are not too small), but the STRATA statement makes the balance "as good as possible." Apply PROC FREQ to HAVE and VALSET to see the impact of the STRATA statement:

proc freq data=have;
tables sex*var2*var3 / list;
run;
Watts
SAS Employee

This following PROC SURVEYSELECT code splits the data set into two groups (70% and 30%) and maintains the 70/30 distribution in the subgroups ('Sex' and 2 other categorical variables).

 

proc surveyselect data=have rate=0.70 outall out=result; 
strata Sex Var1 Var2;
run;

The OUTALL option includes all observations (both Training and Validation) in the output data set. The variable Selected is 1 for observations in the Training group (70%) and 0 for observations in the Validation group (30%). 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 2384 views
  • 2 likes
  • 5 in conversation