<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Finding the best subset of a given size. in Statistical Procedures</title>
    <link>https://communities.sas.com/t5/Statistical-Procedures/Finding-the-best-subset-of-a-given-size/m-p/47503#M2074</link>
    <description>Well, since combinatorial search is out, a Monte Carlo approach could be adopted once a "how good is it" function is specified (e.g. percentile matching). &lt;BR /&gt;
&lt;BR /&gt;
As an example start with a random subset of the specified size, pick a member and replace with a randomly selected value from the population if the figure of merit improves, allowing a pre-specified number of trials. If a member can be singly evaluated pick the worst one. Stop when &lt;I&gt;k&lt;/I&gt; members have failed. Restart with a new subset, etc. Keep the best one.</description>
    <pubDate>Tue, 23 Jun 2009 18:49:37 GMT</pubDate>
    <dc:creator>deleted_user</dc:creator>
    <dc:date>2009-06-23T18:49:37Z</dc:date>
    <item>
      <title>Finding the best subset of a given size.</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Finding-the-best-subset-of-a-given-size/m-p/47499#M2070</link>
      <description>Is there any method/procedure in SAS to use, when you want to find the subset(s) that best describes the original set?&lt;BR /&gt;
&lt;BR /&gt;
I have a table with 7 numerical variables. The number of records are 158 792.&lt;BR /&gt;
I want to find the subset(s) of size 5 000 records that can be regarded as the best sample(s) of that particular size.&lt;BR /&gt;
&lt;BR /&gt;
How can SAS be used to solve this?&lt;BR /&gt;
&lt;BR /&gt;
Binomial(158 792, 5 000 ) is a very big number.&lt;BR /&gt;
&lt;BR /&gt;
Anne</description>
      <pubDate>Mon, 22 Jun 2009 06:58:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Finding-the-best-subset-of-a-given-size/m-p/47499#M2070</guid>
      <dc:creator>deleted_user</dc:creator>
      <dc:date>2009-06-22T06:58:14Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the best subset of a given size.</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Finding-the-best-subset-of-a-given-size/m-p/47500#M2071</link>
      <description>Anne,&lt;BR /&gt;
&lt;BR /&gt;
What criteria are you going to use to decide a ranking of the subsets?  Is it critical to determine the absolute best on that scale, or would any of, say, 100 subsets that were approximately equal be sufficient?&lt;BR /&gt;
&lt;BR /&gt;
And you are absolutely right that Binomial(158 792, 5 000 ) is a very big number--you are in the cosmological realm with that.  And that is what makes me wonder why you would need the "best" subset, rather than a properly stratified random sample.&lt;BR /&gt;
&lt;BR /&gt;
Good luck.&lt;BR /&gt;
&lt;BR /&gt;
Steve Denham</description>
      <pubDate>Mon, 22 Jun 2009 12:29:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Finding-the-best-subset-of-a-given-size/m-p/47500#M2071</guid>
      <dc:creator>SteveDenham</dc:creator>
      <dc:date>2009-06-22T12:29:58Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the best subset of a given size.</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Finding-the-best-subset-of-a-given-size/m-p/47501#M2072</link>
      <description>Steve&lt;BR /&gt;
&lt;BR /&gt;
If I put it this way:&lt;BR /&gt;
&lt;BR /&gt;
The seven variables can have the values of any of the integers 1-5. Then there are 5^7=78 125 possible &lt;B&gt;different&lt;/B&gt; combinations.&lt;BR /&gt;
&lt;BR /&gt;
I generate a SAS table with all the possible combinations. That's easy to do with SAS code. &lt;BR /&gt;
&lt;BR /&gt;
Then I would like to know if in a subset, for instance with 900 of the 78 125 records, there would be at least one record where the values would be equal for at least 5 of the 7 variables, if I checked it against &lt;B&gt;any of the 78 125 records&lt;/B&gt;.&lt;BR /&gt;
&lt;BR /&gt;
If it was possible to even find the &lt;B&gt;best&lt;/B&gt;(=with the biggest number of records with 5 values equal, or 6 values, if that was possible) &lt;B&gt;combination&lt;/B&gt; of a 900 record subset, that would even be better.&lt;BR /&gt;
&lt;BR /&gt;
Are any of these things possible to find out with SAS ( in a lifetime )?&lt;BR /&gt;
&lt;BR /&gt;
Anne

Message was edited by: AnneP</description>
      <pubDate>Mon, 22 Jun 2009 17:58:40 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Finding-the-best-subset-of-a-given-size/m-p/47501#M2072</guid>
      <dc:creator>deleted_user</dc:creator>
      <dc:date>2009-06-22T17:58:40Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the best subset of a given size.</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Finding-the-best-subset-of-a-given-size/m-p/47502#M2073</link>
      <description>Even the smaller problem is not tractable, if we are to examine all possible combinations.  Let's assume that you have access to a petaflop speed machine, and it takes 10 floating point operations to characterize a single combination.  That means we could generate 10^14 combinations in one second. Using Ramanujan's approximation for log n!, I get something like 10^2106 years to generate all possible 900 record subsets.  So let's not look at all possibles, it is just too daunting.  &lt;BR /&gt;
&lt;BR /&gt;
Instead, let's see how many subsets we would have to look at to have a 50% chance of finding one that has at least 5 equal values.  First, using a hypergeometric distribution, the probability that any single subset of 900 has at least 5 equal values is 4.13*10^-9.  Now we consider a geometric distribution, apply some approximations, and see that we need to generate about 121 million successive subsets to have a roughly 50% chance of finding one that has at least five equal values.  That might be doable in SAS in a reasonable amount of time&lt;BR /&gt;
&lt;BR /&gt;
The original problem is still intractable.  The probability of at least 5 equal values in this situation is on the order of 10^-40, so we are looking at about 2.6*10^40 subsets to have a roughly 50% chance of finding one.  On the petaflop machine, this is still 10^25 seconds or 3.2*10^17 years.</description>
      <pubDate>Tue, 23 Jun 2009 13:56:10 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Finding-the-best-subset-of-a-given-size/m-p/47502#M2073</guid>
      <dc:creator>SteveDenham</dc:creator>
      <dc:date>2009-06-23T13:56:10Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the best subset of a given size.</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Finding-the-best-subset-of-a-given-size/m-p/47503#M2074</link>
      <description>Well, since combinatorial search is out, a Monte Carlo approach could be adopted once a "how good is it" function is specified (e.g. percentile matching). &lt;BR /&gt;
&lt;BR /&gt;
As an example start with a random subset of the specified size, pick a member and replace with a randomly selected value from the population if the figure of merit improves, allowing a pre-specified number of trials. If a member can be singly evaluated pick the worst one. Stop when &lt;I&gt;k&lt;/I&gt; members have failed. Restart with a new subset, etc. Keep the best one.</description>
      <pubDate>Tue, 23 Jun 2009 18:49:37 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Finding-the-best-subset-of-a-given-size/m-p/47503#M2074</guid>
      <dc:creator>deleted_user</dc:creator>
      <dc:date>2009-06-23T18:49:37Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the best subset of a given size.</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Finding-the-best-subset-of-a-given-size/m-p/47504#M2075</link>
      <description>This would be my preference as well.  The algorithm could also prune branches that are obvious dead ends as well, as in the exact algorithms in FREQ, MULTTEST and LOGISTIC.&lt;BR /&gt;
&lt;BR /&gt;
But this bangs up against theoretical computer science, in an interesting way.  We have traded a very tedious polynomial time process for a non-polynomial time one (a classical stopping problem).  That doesn't mean it will take longer--it's just interesting.</description>
      <pubDate>Wed, 24 Jun 2009 11:59:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Finding-the-best-subset-of-a-given-size/m-p/47504#M2075</guid>
      <dc:creator>SteveDenham</dc:creator>
      <dc:date>2009-06-24T11:59:09Z</dc:date>
    </item>
  </channel>
</rss>

