Solved: Splitting Data with the same proportion of covariates

delgaa07 · Posted 12-09-2020 01:45 PM

Hi everyone,

I am in healthcare and am having a hard time figuring out how to split the data into a Training and Validation (70/30) sets where important covariates, such as Sex are balanced; I have 2 other categorical variables with contrast dummy codes that need to be balance as well after the split. Ultimately I will be putting this data through an elastic net regression analysis.

Do date, I've used PROC GLMSELECT but don't know if I can use PROC SURVEYSELECT or a DATA step.

can anyone point me in the right direction?

Watts · Posted 12-09-2020 06:37 PM

This following PROC SURVEYSELECT code splits the data set into two groups (70% and 30%) and maintains the 70/30 distribution in the subgroups ('Sex' and 2 other categorical variables).

proc surveyselect data=have rate=0.70 outall out=result; 
strata Sex Var1 Var2;
run;

The OUTALL option includes all observations (both Training and Validation) in the output data set. The variable Selected is 1 for observations in the Training group (70%) and 0 for observations in the Validation group (30%).

View solution in original post

PaigeMiller · Posted 12-09-2020 01:53 PM

Do you mean that you want 70% of the males to be in the training set and 30% in the validation set? What happens if 70% of the males is not an integer?

Do you mean that you want 70% of the females to be in the training set and 30% in the validation set? What happens if 70% of the females is not an integer?

--
Paige Miller

delgaa07 · Posted 12-09-2020 01:59 PM

Yes exactly. I am wanting roughly the same amount of males/females in the training and validation sets. It' not completely necessary for these to be an integer, overall I want to see if they can be roughly equal.

PaigeMiller · Posted 12-09-2020 03:05 PM

For one variable, like gender, you probably can so this SURVEYSELECT but as I am not that familiar with that procedure, it's pretty easy to do with a data step and maybe PROC RANK.

In a data step, assign random numbers to all of the subjects. Then use PROC RANK, with a BY GENDER; statement to rank the random numbers on a scale from 0 to 1 (the FRACTION option does this), and then any male or female whose rank fraction is less than 0.7 goes into the training group.

For multiple categorical variables, you just need to use BY GENDER VAR1 VAR2; and do the ranking with this 3-way group of variables.

--
Paige Miller

delgaa07 · Posted 12-09-2020 03:35 PM

I'll definitely give this a shot.

Thank you for helping!!

mkeintz · Posted 12-09-2020 03:28 PM

Do you want them balanced 70/30 at the marginal frequency levels (i.e. at the totals by sex, totals by categorical var 1, totals by categorical var 2)? Or do you want them balanced at every cross classification level?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

delgaa07 · Posted 12-09-2020 03:35 PM

I would preferably like to have them balanced at every cross classification level.

Thanks, in advanced, for all your help!

mkeintz · Posted 12-09-2020 05:42 PM

I'm not familiar with GLMSELECT or SURVEYSELECT. Here is a program that randomly assigns exactly 70% in training group for each cross-classification of SEX/CATVAR1/CATVAR2 whenever possible (i.e. whenever 70% of the cell count is an exact integer).

When 70% is not an exact integer the "extra" observation is randomly assigned to one or the other group. The odds for this observation is a randomized function, based on how far above an integer the exact 70% is, as here:

Cell Size	Min N Training	Min N (Validation)	Prob (Extra Obs=> Training)
10	7	3	No extra
11	7	3	0.70
12	8	3	0.40
13	9	3	0.10
14	9	4	0.80
15	10	4	0.50
16	11	4	0.20
17	11	5	0.90
18	12	5	0.60
19	13	5	0.30
20	14	6	No extra

The downside to the code below is that it requires a grasp of hash objects (for this case, think "lookup tables" stored in memory). There will be one lookup table (named "h" in each case) for each sex/catvar1/catvar2 combination. To always have to correct h in hand, there is a hash-of-hashes object ('hoh' below) that contains pointers to each h, based upon the values of sex/catvar1/catvar2:

data have ;
  do sex='M','F';
    do catvar1=1,2,3;
	  do catvar2='A','B','C';
	    do until (mod(id,11)=0);
		  id+1;
		  output;
		end;
	  end;
	end;
  end;
run;

data dummy; 
  set have;
  length randnum 8;
  stop;
run;

data training (where=(flag='Training'))
     validation (where=(flag='Valid'));
  set have end=end_of_have;

  call streaminit(156667);
  randnum=rand('uniform');

  if _n_=1 then do;
    /* Hash of hashes object to point to the correct 
       object "h" for each sex/catvar1/catvar2 combo*/

    declare hash hoh (ordered:"A");
	  hoh.definekey('sex','catvar1','catvar2');
	  hoh.definedata('sex','catvar1','catvar2','h','hi');
	  hoh.definedone();
	declare hiter hohi ('hoh');
	declare hash h;    /*Don't instantiate, but reserve the name */
	declare hiter hi;  /*Don't instantiate, but reserve the name */
  end;

  _rc=hoh.find();      /* Load the corresponding hash H and hiter HI */
  if _rc^=0 then do;   /* If no such H/HI, then instantiate them */
    h=_new_ hash(ordered:'a',dataset:'dummy');
	  h.definekey('randnum');
	  h.definedata(all:'Y');
	  h.definedone();
	hi=_new_ hiter('h');
	hoh.add();
  end;

  h.add();  /* Add each obs to appropriate h */


  /* Step through all of the hashes identified in hoh */
  /* In each hash h, get its size (num_items) to calculate 70% */

  if end_of_have then do while (hohi.next()=0);
    expected_training=.7*h.num_items;
    do _i=1 by 1 while (hi.next()=0);
	  if _i - rand('uniform') <= expected_training then flag='Training';
	  else flag='Valid';
	  output;
	end;
  end;

  drop _: ;
run;

I created a test dataset HAVE with 11 cases in each call above (see the "mod(id,11)=0" function).

The empty dataset DUMMY is there just to put the variable RANDNUM into its header. This makes for simpler syntax in instantiating hash objects h, because I don't have to individually list each variable in the h.definedata statement.

When reading have, the program sticks observations into the appropriate "h", but they are now randomly ordered.

At the end of reading dataset HAVE, the program steps through each object and outputs every retrieved "row". The first batch if flagged for Training (with random assignment of the "extra" obs). The rest are flagged for Validation.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

FreelanceReinh · Posted 12-09-2020 06:32 PM

Hi @delgaa07,

I would probably use PROC SURVEYSELECT:

proc surveyselect data=have
method=srs rate=30
seed=2718 out=valset(drop=selectionprob samplingweight);
strata sex var2 var3;
run;

This would select the validation set (30% sample) as the union of 30% samples from each stratum (=subset defined by the combination of values of the three categorical variables in the STRATA statement). Make sure to include a unique ID for each observation (e.g., an observation number if there is no "natural" ID) so that you can obtain the training set "HAVE minus VALSET" by a simple MERGE step.

Even without the STRATA statement the proportions of the strata will be roughly balanced between the two subsets in most cases (if the strata are not too small), but the STRATA statement makes the balance "as good as possible." Apply PROC FREQ to HAVE and VALSET to see the impact of the STRATA statement:

proc freq data=have;
tables sex*var2*var3 / list;
run;

Watts · Posted 12-09-2020 06:37 PM

This following PROC SURVEYSELECT code splits the data set into two groups (70% and 30%) and maintains the 70/30 distribution in the subgroups ('Sex' and 2 other categorical variables).

proc surveyselect data=have rate=0.70 outall out=result; 
strata Sex Var1 Var2;
run;

The OUTALL option includes all observations (both Training and Validation) in the output data set. The variable Selected is 1 for observations in the Training group (70%) and 0 for observations in the Validation group (30%).

Splitting Data with the same proportion of covariates

Re: Splitting Data with the same proportion of covariates

Re: Splitting Data with the same proportion of covariates

Re: Splitting Data with the same proportion of covariates

Re: Splitting Data with the same proportion of covariates

Re: Splitting Data with the same proportion of covariates

Re: Splitting Data with the same proportion of covariates

Re: Splitting Data with the same proportion of covariates

Re: Splitting Data with the same proportion of covariates

Re: Splitting Data with the same proportion of covariates

Re: Splitting Data with the same proportion of covariates

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!