<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: even sampling but how ? in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/even-sampling-but-how/m-p/402948#M278833</link>
    <description>&lt;P&gt;What do you mean by: "&amp;nbsp;I want to streamline the process of sampling evenly."&lt;/P&gt;
&lt;P&gt;Do you mean select the same number of records for each teacher or select the same proportion&amp;nbsp;such as&amp;nbsp;10% per teacher?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you select a number of records what do you want done with the teachers that have fewer than that number?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In either case I think that proc surveyselect may work. Sort you data by teacher and use teacher as a stratum variable.&lt;/P&gt;
&lt;P&gt;Here are examples using the SASHELP.CLASS data set. The first Surveyselect block&amp;nbsp;selects 5 records from each sex, the second selects 25% from each sex.&lt;/P&gt;
&lt;PRE&gt;proc sort data=sashelp.class out=work.classsort;
   by sex;
run;

proc surveyselect data=work.classsort
     out=work.numberperstrat
     sampsize=5
     ;
   stratum sex;
run;


proc surveyselect data=work.classsort
     out=work.rateperstrat
     samprate=.25
     ;
   stratum sex;
run;&lt;/PRE&gt;
&lt;P&gt;There are additional options you use to control minimum or maximum samples per strata. Also the output data set contains a selection probability and the sample design weight for using that record and by default all of the other variables in the data set are included in the output.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 10 Oct 2017 20:31:36 GMT</pubDate>
    <dc:creator>ballardw</dc:creator>
    <dc:date>2017-10-10T20:31:36Z</dc:date>
    <item>
      <title>even sampling but how ?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/even-sampling-but-how/m-p/402297#M278830</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have the following scenario and wanted to ask what the best way of &lt;FONT color="#FF0000"&gt;sampling&lt;/FONT&gt; is:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a huge dataset of teachers and their reporting (skills).&amp;nbsp;&lt;/P&gt;&lt;P&gt;Every teacher has to conduct reports about the progress of his / her students. But some teachers will create much larger quantities of reports than others - simply because they have more pupils or are more efficient in writing reports.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Teachers shall be assessed regarding the reporting skills, whereby the way they assess their pupils is taken into account. However, I am not worried about this yet.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I simply want to find a method of sampling teachers in the best possible way. Say we have a dataset of 20 mil&amp;nbsp;teacher-reports. Some teachers have written wast amount of reports during their lifetime, others only a few. So if I was to sample them randomly, those who have written more reports during their lifetime would be somewhat oversampled.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I guess I could dedup all teachers, and then take a random sample.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But I do not urgently want to do that due to the sheer size of the dataset.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there any better way? I want to streamline the process of sampling evenly.&amp;nbsp; How can I achieve this?&lt;/P&gt;&lt;P&gt;Many thanks for any thoughts and help.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 09 Oct 2017 08:44:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/even-sampling-but-how/m-p/402297#M278830</guid>
      <dc:creator>Tinker_</dc:creator>
      <dc:date>2017-10-09T08:44:01Z</dc:date>
    </item>
    <item>
      <title>Re: even sampling but how ?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/even-sampling-but-how/m-p/402308#M278831</link>
      <description>&lt;P&gt;First of all, your initial dataset should be sorted in a sensible way (teacher-ID, then date,...), so you can easily merge from it.&lt;/P&gt;
&lt;P&gt;At least have an index on teacher-ID defined.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Then do&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort
  data=reports (keep=teacher_ID)
  out=teachers
  nodupkey
;
by teacher_ID;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;and you will have a deduplicated dataset, from which it is easy to create samples.&lt;/P&gt;
&lt;P&gt;Suppose you have a UUID-based key, which takes just 16 binary bytes to store, you'll end up with ~16 MB per 1 million teachers.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Since all files are sorted by the same key, a merge to get the individual reports is just a sequential scan through the datasets.&lt;/P&gt;</description>
      <pubDate>Mon, 09 Oct 2017 10:34:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/even-sampling-but-how/m-p/402308#M278831</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2017-10-09T10:34:20Z</dc:date>
    </item>
    <item>
      <title>Re: even sampling but how ?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/even-sampling-but-how/m-p/402490#M278832</link>
      <description>&lt;P&gt;Let's say dataset HAVE is sorted by teacherid, where there can be&amp;nbsp;any number of&amp;nbsp;reports by a given teacher.&amp;nbsp;&amp;nbsp; And you want a sample of, say&amp;nbsp;3 from each teacher.&amp;nbsp; This is straightforward by using double&amp;nbsp;&amp;nbsp;DO loops with SETs inside:&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data want;
  do nrpt=1 by 1 until (last.teacherid);
    set have;
    by teacherid;
  end;

  nwant=3;

  do nrpt=nrpt to 1 by -1;
    set have;
    if rand('uniform') &amp;lt;= nwant/nrpt then do;
      nwant=nwant-1;
      output;
    end;
  end;
run;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 09 Oct 2017 19:45:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/even-sampling-but-how/m-p/402490#M278832</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2017-10-09T19:45:43Z</dc:date>
    </item>
    <item>
      <title>Re: even sampling but how ?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/even-sampling-but-how/m-p/402948#M278833</link>
      <description>&lt;P&gt;What do you mean by: "&amp;nbsp;I want to streamline the process of sampling evenly."&lt;/P&gt;
&lt;P&gt;Do you mean select the same number of records for each teacher or select the same proportion&amp;nbsp;such as&amp;nbsp;10% per teacher?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you select a number of records what do you want done with the teachers that have fewer than that number?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In either case I think that proc surveyselect may work. Sort you data by teacher and use teacher as a stratum variable.&lt;/P&gt;
&lt;P&gt;Here are examples using the SASHELP.CLASS data set. The first Surveyselect block&amp;nbsp;selects 5 records from each sex, the second selects 25% from each sex.&lt;/P&gt;
&lt;PRE&gt;proc sort data=sashelp.class out=work.classsort;
   by sex;
run;

proc surveyselect data=work.classsort
     out=work.numberperstrat
     sampsize=5
     ;
   stratum sex;
run;


proc surveyselect data=work.classsort
     out=work.rateperstrat
     samprate=.25
     ;
   stratum sex;
run;&lt;/PRE&gt;
&lt;P&gt;There are additional options you use to control minimum or maximum samples per strata. Also the output data set contains a selection probability and the sample design weight for using that record and by default all of the other variables in the data set are included in the output.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Oct 2017 20:31:36 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/even-sampling-but-how/m-p/402948#M278833</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2017-10-10T20:31:36Z</dc:date>
    </item>
  </channel>
</rss>

