<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Splitting Data with the same proportion of covariates in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704883#M216123</link>
    <description>&lt;P&gt;I'm not familiar with GLMSELECT or SURVEYSELECT.&amp;nbsp; Here is a program that randomly assigns exactly 70% in training group for each cross-classification of SEX/CATVAR1/CATVAR2 whenever possible (i.e. whenever 70% of the cell count is an exact integer).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When 70% is not an exact integer the "extra" observation is randomly assigned to one or the other group.&amp;nbsp; The odds for this observation is a randomized function, based on how far above an integer the exact 70% is, as here:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;TABLE width="305"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="54"&gt;Cell&lt;BR /&gt;Size&lt;/TD&gt;
&lt;TD width="73"&gt;Min N&lt;BR /&gt;Training&lt;/TD&gt;
&lt;TD width="93"&gt;Min N&lt;BR /&gt;(Validation)&lt;/TD&gt;
&lt;TD width="85"&gt;Prob (Extra Obs=&amp;gt; Training)&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;10&lt;/TD&gt;
&lt;TD&gt;7&lt;/TD&gt;
&lt;TD&gt;3&lt;/TD&gt;
&lt;TD&gt;No extra&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;11&lt;/TD&gt;
&lt;TD&gt;7&lt;/TD&gt;
&lt;TD&gt;3&lt;/TD&gt;
&lt;TD&gt;0.70&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;12&lt;/TD&gt;
&lt;TD&gt;8&lt;/TD&gt;
&lt;TD&gt;3&lt;/TD&gt;
&lt;TD&gt;0.40&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;13&lt;/TD&gt;
&lt;TD&gt;9&lt;/TD&gt;
&lt;TD&gt;3&lt;/TD&gt;
&lt;TD&gt;0.10&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;14&lt;/TD&gt;
&lt;TD&gt;9&lt;/TD&gt;
&lt;TD&gt;4&lt;/TD&gt;
&lt;TD&gt;0.80&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;15&lt;/TD&gt;
&lt;TD&gt;10&lt;/TD&gt;
&lt;TD&gt;4&lt;/TD&gt;
&lt;TD&gt;0.50&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;16&lt;/TD&gt;
&lt;TD&gt;11&lt;/TD&gt;
&lt;TD&gt;4&lt;/TD&gt;
&lt;TD&gt;0.20&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;17&lt;/TD&gt;
&lt;TD&gt;11&lt;/TD&gt;
&lt;TD&gt;5&lt;/TD&gt;
&lt;TD&gt;0.90&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;18&lt;/TD&gt;
&lt;TD&gt;12&lt;/TD&gt;
&lt;TD&gt;5&lt;/TD&gt;
&lt;TD&gt;0.60&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;19&lt;/TD&gt;
&lt;TD&gt;13&lt;/TD&gt;
&lt;TD&gt;5&lt;/TD&gt;
&lt;TD&gt;0.30&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;20&lt;/TD&gt;
&lt;TD&gt;14&lt;/TD&gt;
&lt;TD&gt;6&lt;/TD&gt;
&lt;TD&gt;No extra&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The downside to the code below is that it requires a grasp of hash objects (for this case, think "lookup tables" stored in memory).&amp;nbsp; There will be one lookup table (named "h" in each case) for each sex/catvar1/catvar2 combination.&amp;nbsp; To always have to correct h in hand, there is a hash-of-hashes object ('hoh' below) that contains pointers to each h, based upon the values of sex/catvar1/catvar2:&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have ;
  do sex='M','F';
    do catvar1=1,2,3;
	  do catvar2='A','B','C';
	    do until (mod(id,11)=0);
		  id+1;
		  output;
		end;
	  end;
	end;
  end;
run;

data dummy; 
  set have;
  length randnum 8;
  stop;
run;

data training (where=(flag='Training'))
     validation (where=(flag='Valid'));
  set have end=end_of_have;

  call streaminit(156667);
  randnum=rand('uniform');

  if _n_=1 then do;
    /* Hash of hashes object to point to the correct 
       object "h" for each sex/catvar1/catvar2 combo*/

    declare hash hoh (ordered:"A");
	  hoh.definekey('sex','catvar1','catvar2');
	  hoh.definedata('sex','catvar1','catvar2','h','hi');
	  hoh.definedone();
	declare hiter hohi ('hoh');
	declare hash h;    /*Don't instantiate, but reserve the name */
	declare hiter hi;  /*Don't instantiate, but reserve the name */
  end;

  _rc=hoh.find();      /* Load the corresponding hash H and hiter HI */
  if _rc^=0 then do;   /* If no such H/HI, then instantiate them */
    h=_new_ hash(ordered:'a',dataset:'dummy');
	  h.definekey('randnum');
	  h.definedata(all:'Y');
	  h.definedone();
	hi=_new_ hiter('h');
	hoh.add();
  end;

  h.add();  /* Add each obs to appropriate h */


  /* Step through all of the hashes identified in hoh */
  /* In each hash h, get its size (num_items) to calculate 70% */

  if end_of_have then do while (hohi.next()=0);
    expected_training=.7*h.num_items;
    do _i=1 by 1 while (hi.next()=0);
	  if _i - rand('uniform') &amp;lt;= expected_training then flag='Training';
	  else flag='Valid';
	  output;
	end;
  end;

  drop _: ;
run;

&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;I created a test dataset HAVE with 11 cases in each call above (see the "mod(id,11)=0" function).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The empty dataset DUMMY is there just to put the variable RANDNUM into its header.&amp;nbsp; This makes for simpler syntax in instantiating hash objects h, because I don't have to individually list each variable in the h.definedata statement.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When reading have, the program sticks observations into the appropriate "h", but they are now randomly ordered.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;At the end of reading dataset HAVE, the program steps through each object and outputs every retrieved "row".&amp;nbsp; The first batch if flagged for Training (with random assignment of the "extra" obs).&amp;nbsp; The rest are flagged for Validation.&lt;/P&gt;</description>
    <pubDate>Wed, 09 Dec 2020 22:42:46 GMT</pubDate>
    <dc:creator>mkeintz</dc:creator>
    <dc:date>2020-12-09T22:42:46Z</dc:date>
    <item>
      <title>Splitting Data with the same proportion of covariates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704832#M216101</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am in healthcare and am having a hard time figuring out how to split the data into a Training and Validation (70/30) sets where important covariates, such as Sex are balanced; I have 2 other categorical variables with contrast dummy codes that need to be balance as well after the split. Ultimately I will be putting this data through an elastic net regression analysis.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Do date, I've used PROC GLMSELECT but don't know if I can use PROC SURVEYSELECT or a DATA step.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;can anyone point me in the right direction?&lt;/P&gt;</description>
      <pubDate>Wed, 09 Dec 2020 18:45:07 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704832#M216101</guid>
      <dc:creator>delgaa07</dc:creator>
      <dc:date>2020-12-09T18:45:07Z</dc:date>
    </item>
    <item>
      <title>Re: Splitting Data with the same proportion of covariates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704835#M216103</link>
      <description>&lt;P&gt;Do you mean that you want 70% of the males to be in the training set and 30% in the validation set?&amp;nbsp;What happens if 70% of the males is not an integer?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Do you mean that you want 70% of the females to be in the training set and 30% in the validation set?&amp;nbsp;What happens if 70% of the females is not an integer?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 09 Dec 2020 18:54:04 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704835#M216103</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2020-12-09T18:54:04Z</dc:date>
    </item>
    <item>
      <title>Re: Splitting Data with the same proportion of covariates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704836#M216104</link>
      <description>&lt;P&gt;Yes exactly. I am wanting roughly the same amount of males/females in the training and validation sets. It' not completely necessary for these to be an integer, overall I want to see if they can be roughly equal.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 09 Dec 2020 18:59:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704836#M216104</guid>
      <dc:creator>delgaa07</dc:creator>
      <dc:date>2020-12-09T18:59:17Z</dc:date>
    </item>
    <item>
      <title>Re: Splitting Data with the same proportion of covariates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704854#M216109</link>
      <description>&lt;P&gt;For one variable, like gender, you probably can so this SURVEYSELECT but as I am not that familiar with that procedure, it's pretty easy to do with a data step and maybe PROC RANK.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In a data step, assign random numbers to all of the subjects. Then use PROC RANK, with a BY GENDER; statement to rank the random numbers on a scale from 0 to 1 (the FRACTION option does this), and then any male or female whose rank fraction is less than 0.7 goes into the training group.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For multiple categorical variables, you just need to use BY GENDER VAR1 VAR2; and do the ranking with this 3-way group of variables.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 09 Dec 2020 20:05:52 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704854#M216109</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2020-12-09T20:05:52Z</dc:date>
    </item>
    <item>
      <title>Re: Splitting Data with the same proportion of covariates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704862#M216110</link>
      <description>&lt;P&gt;Do you want them balanced 70/30 at the marginal frequency levels (i.e. at the totals by sex, totals by categorical var 1, totals by categorical var 2)?&amp;nbsp; Or do you want them balanced at every cross classification level?&lt;/P&gt;</description>
      <pubDate>Wed, 09 Dec 2020 20:28:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704862#M216110</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2020-12-09T20:28:47Z</dc:date>
    </item>
    <item>
      <title>Re: Splitting Data with the same proportion of covariates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704863#M216111</link>
      <description>&lt;P&gt;I would preferably like to have them&amp;nbsp;balanced at every cross classification level.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks, in advanced, for all your help!&lt;/P&gt;</description>
      <pubDate>Wed, 09 Dec 2020 20:35:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704863#M216111</guid>
      <dc:creator>delgaa07</dc:creator>
      <dc:date>2020-12-09T20:35:09Z</dc:date>
    </item>
    <item>
      <title>Re: Splitting Data with the same proportion of covariates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704864#M216112</link>
      <description>&lt;P&gt;I'll definitely give this a shot.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for helping!!&lt;/P&gt;</description>
      <pubDate>Wed, 09 Dec 2020 20:35:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704864#M216112</guid>
      <dc:creator>delgaa07</dc:creator>
      <dc:date>2020-12-09T20:35:48Z</dc:date>
    </item>
    <item>
      <title>Re: Splitting Data with the same proportion of covariates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704883#M216123</link>
      <description>&lt;P&gt;I'm not familiar with GLMSELECT or SURVEYSELECT.&amp;nbsp; Here is a program that randomly assigns exactly 70% in training group for each cross-classification of SEX/CATVAR1/CATVAR2 whenever possible (i.e. whenever 70% of the cell count is an exact integer).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When 70% is not an exact integer the "extra" observation is randomly assigned to one or the other group.&amp;nbsp; The odds for this observation is a randomized function, based on how far above an integer the exact 70% is, as here:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;TABLE width="305"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="54"&gt;Cell&lt;BR /&gt;Size&lt;/TD&gt;
&lt;TD width="73"&gt;Min N&lt;BR /&gt;Training&lt;/TD&gt;
&lt;TD width="93"&gt;Min N&lt;BR /&gt;(Validation)&lt;/TD&gt;
&lt;TD width="85"&gt;Prob (Extra Obs=&amp;gt; Training)&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;10&lt;/TD&gt;
&lt;TD&gt;7&lt;/TD&gt;
&lt;TD&gt;3&lt;/TD&gt;
&lt;TD&gt;No extra&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;11&lt;/TD&gt;
&lt;TD&gt;7&lt;/TD&gt;
&lt;TD&gt;3&lt;/TD&gt;
&lt;TD&gt;0.70&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;12&lt;/TD&gt;
&lt;TD&gt;8&lt;/TD&gt;
&lt;TD&gt;3&lt;/TD&gt;
&lt;TD&gt;0.40&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;13&lt;/TD&gt;
&lt;TD&gt;9&lt;/TD&gt;
&lt;TD&gt;3&lt;/TD&gt;
&lt;TD&gt;0.10&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;14&lt;/TD&gt;
&lt;TD&gt;9&lt;/TD&gt;
&lt;TD&gt;4&lt;/TD&gt;
&lt;TD&gt;0.80&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;15&lt;/TD&gt;
&lt;TD&gt;10&lt;/TD&gt;
&lt;TD&gt;4&lt;/TD&gt;
&lt;TD&gt;0.50&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;16&lt;/TD&gt;
&lt;TD&gt;11&lt;/TD&gt;
&lt;TD&gt;4&lt;/TD&gt;
&lt;TD&gt;0.20&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;17&lt;/TD&gt;
&lt;TD&gt;11&lt;/TD&gt;
&lt;TD&gt;5&lt;/TD&gt;
&lt;TD&gt;0.90&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;18&lt;/TD&gt;
&lt;TD&gt;12&lt;/TD&gt;
&lt;TD&gt;5&lt;/TD&gt;
&lt;TD&gt;0.60&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;19&lt;/TD&gt;
&lt;TD&gt;13&lt;/TD&gt;
&lt;TD&gt;5&lt;/TD&gt;
&lt;TD&gt;0.30&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;20&lt;/TD&gt;
&lt;TD&gt;14&lt;/TD&gt;
&lt;TD&gt;6&lt;/TD&gt;
&lt;TD&gt;No extra&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The downside to the code below is that it requires a grasp of hash objects (for this case, think "lookup tables" stored in memory).&amp;nbsp; There will be one lookup table (named "h" in each case) for each sex/catvar1/catvar2 combination.&amp;nbsp; To always have to correct h in hand, there is a hash-of-hashes object ('hoh' below) that contains pointers to each h, based upon the values of sex/catvar1/catvar2:&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have ;
  do sex='M','F';
    do catvar1=1,2,3;
	  do catvar2='A','B','C';
	    do until (mod(id,11)=0);
		  id+1;
		  output;
		end;
	  end;
	end;
  end;
run;

data dummy; 
  set have;
  length randnum 8;
  stop;
run;

data training (where=(flag='Training'))
     validation (where=(flag='Valid'));
  set have end=end_of_have;

  call streaminit(156667);
  randnum=rand('uniform');

  if _n_=1 then do;
    /* Hash of hashes object to point to the correct 
       object "h" for each sex/catvar1/catvar2 combo*/

    declare hash hoh (ordered:"A");
	  hoh.definekey('sex','catvar1','catvar2');
	  hoh.definedata('sex','catvar1','catvar2','h','hi');
	  hoh.definedone();
	declare hiter hohi ('hoh');
	declare hash h;    /*Don't instantiate, but reserve the name */
	declare hiter hi;  /*Don't instantiate, but reserve the name */
  end;

  _rc=hoh.find();      /* Load the corresponding hash H and hiter HI */
  if _rc^=0 then do;   /* If no such H/HI, then instantiate them */
    h=_new_ hash(ordered:'a',dataset:'dummy');
	  h.definekey('randnum');
	  h.definedata(all:'Y');
	  h.definedone();
	hi=_new_ hiter('h');
	hoh.add();
  end;

  h.add();  /* Add each obs to appropriate h */


  /* Step through all of the hashes identified in hoh */
  /* In each hash h, get its size (num_items) to calculate 70% */

  if end_of_have then do while (hohi.next()=0);
    expected_training=.7*h.num_items;
    do _i=1 by 1 while (hi.next()=0);
	  if _i - rand('uniform') &amp;lt;= expected_training then flag='Training';
	  else flag='Valid';
	  output;
	end;
  end;

  drop _: ;
run;

&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;I created a test dataset HAVE with 11 cases in each call above (see the "mod(id,11)=0" function).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The empty dataset DUMMY is there just to put the variable RANDNUM into its header.&amp;nbsp; This makes for simpler syntax in instantiating hash objects h, because I don't have to individually list each variable in the h.definedata statement.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When reading have, the program sticks observations into the appropriate "h", but they are now randomly ordered.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;At the end of reading dataset HAVE, the program steps through each object and outputs every retrieved "row".&amp;nbsp; The first batch if flagged for Training (with random assignment of the "extra" obs).&amp;nbsp; The rest are flagged for Validation.&lt;/P&gt;</description>
      <pubDate>Wed, 09 Dec 2020 22:42:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704883#M216123</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2020-12-09T22:42:46Z</dc:date>
    </item>
    <item>
      <title>Re: Splitting Data with the same proportion of covariates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704890#M216125</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/360254"&gt;@delgaa07&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I would probably use PROC SURVEYSELECT:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc surveyselect data=have
method=srs rate=30
seed=2718 out=valset(drop=selectionprob samplingweight);
strata sex var2 var3;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;This would select the validation set (30% sample) as the union of 30% samples from each stratum (=subset defined by the combination of values of the three categorical variables in the STRATA statement). Make sure to include a unique ID for each observation (e.g., an observation number if there is no "natural" ID) so that you can obtain the training set "HAVE minus VALSET" by a simple MERGE step.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Even without the STRATA statement the proportions of the strata will be &lt;EM&gt;roughly&lt;/EM&gt; balanced between the two subsets in most cases (if the strata are not too small), but the STRATA statement makes the balance "&lt;EM&gt;as good as possible&lt;/EM&gt;." Apply PROC FREQ to HAVE and VALSET to see the impact of the STRATA statement:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc freq data=have;
tables sex*var2*var3 / list;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Wed, 09 Dec 2020 23:32:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704890#M216125</guid>
      <dc:creator>FreelanceReinh</dc:creator>
      <dc:date>2020-12-09T23:32:31Z</dc:date>
    </item>
    <item>
      <title>Re: Splitting Data with the same proportion of covariates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704892#M216126</link>
      <description>&lt;P&gt;This following PROC SURVEYSELECT code splits the data set into two groups (70% and 30%) and maintains the 70/30 distribution in the subgroups ('Sex' and 2 other categorical variables).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc surveyselect data=have rate=0.70 outall out=result; 
strata Sex Var1 Var2;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;The&amp;nbsp;&lt;A href="https://documentation.sas.com/?docsetId=statug&amp;amp;docsetTarget=statug_surveyselect_syntax01.htm&amp;amp;docsetVersion=15.2&amp;amp;locale=en#statug.surveyselect.selectoutall" target="_self"&gt;OUTALL&lt;/A&gt;&amp;nbsp;option includes all observations (both Training and Validation) in the output data set. The variable&amp;nbsp;&lt;EM&gt;Selected&lt;/EM&gt;&amp;nbsp;is 1 for observations in the Training group (70%) and 0 for observations in the Validation group (30%).&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 09 Dec 2020 23:37:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Splitting-Data-with-the-same-proportion-of-covariates/m-p/704892#M216126</guid>
      <dc:creator>Watts</dc:creator>
      <dc:date>2020-12-09T23:37:55Z</dc:date>
    </item>
  </channel>
</rss>

