Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Finding the best subset of a given size.

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 06-22-2009 02:58 AM
(1457 views)

Is there any method/procedure in SAS to use, when you want to find the subset(s) that best describes the original set?

I have a table with 7 numerical variables. The number of records are 158 792.

I want to find the subset(s) of size 5 000 records that can be regarded as the best sample(s) of that particular size.

How can SAS be used to solve this?

Binomial(158 792, 5 000 ) is a very big number.

Anne

I have a table with 7 numerical variables. The number of records are 158 792.

I want to find the subset(s) of size 5 000 records that can be regarded as the best sample(s) of that particular size.

How can SAS be used to solve this?

Binomial(158 792, 5 000 ) is a very big number.

Anne

5 REPLIES 5

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Anne,

What criteria are you going to use to decide a ranking of the subsets? Is it critical to determine the absolute best on that scale, or would any of, say, 100 subsets that were approximately equal be sufficient?

And you are absolutely right that Binomial(158 792, 5 000 ) is a very big number--you are in the cosmological realm with that. And that is what makes me wonder why you would need the "best" subset, rather than a properly stratified random sample.

Good luck.

Steve Denham

What criteria are you going to use to decide a ranking of the subsets? Is it critical to determine the absolute best on that scale, or would any of, say, 100 subsets that were approximately equal be sufficient?

And you are absolutely right that Binomial(158 792, 5 000 ) is a very big number--you are in the cosmological realm with that. And that is what makes me wonder why you would need the "best" subset, rather than a properly stratified random sample.

Good luck.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Steve

If I put it this way:

The seven variables can have the values of any of the integers 1-5. Then there are 5^7=78 125 possible**different** combinations.

I generate a SAS table with all the possible combinations. That's easy to do with SAS code.

Then I would like to know if in a subset, for instance with 900 of the 78 125 records, there would be at least one record where the values would be equal for at least 5 of the 7 variables, if I checked it against**any of the 78 125 records**.

If it was possible to even find the**best**(=with the biggest number of records with 5 values equal, or 6 values, if that was possible) **combination** of a 900 record subset, that would even be better.

Are any of these things possible to find out with SAS ( in a lifetime )?

Anne Message was edited by: AnneP

If I put it this way:

The seven variables can have the values of any of the integers 1-5. Then there are 5^7=78 125 possible

I generate a SAS table with all the possible combinations. That's easy to do with SAS code.

Then I would like to know if in a subset, for instance with 900 of the 78 125 records, there would be at least one record where the values would be equal for at least 5 of the 7 variables, if I checked it against

If it was possible to even find the

Are any of these things possible to find out with SAS ( in a lifetime )?

Anne Message was edited by: AnneP

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Even the smaller problem is not tractable, if we are to examine all possible combinations. Let's assume that you have access to a petaflop speed machine, and it takes 10 floating point operations to characterize a single combination. That means we could generate 10^14 combinations in one second. Using Ramanujan's approximation for log n!, I get something like 10^2106 years to generate all possible 900 record subsets. So let's not look at all possibles, it is just too daunting.

Instead, let's see how many subsets we would have to look at to have a 50% chance of finding one that has at least 5 equal values. First, using a hypergeometric distribution, the probability that any single subset of 900 has at least 5 equal values is 4.13*10^-9. Now we consider a geometric distribution, apply some approximations, and see that we need to generate about 121 million successive subsets to have a roughly 50% chance of finding one that has at least five equal values. That might be doable in SAS in a reasonable amount of time

The original problem is still intractable. The probability of at least 5 equal values in this situation is on the order of 10^-40, so we are looking at about 2.6*10^40 subsets to have a roughly 50% chance of finding one. On the petaflop machine, this is still 10^25 seconds or 3.2*10^17 years.

Instead, let's see how many subsets we would have to look at to have a 50% chance of finding one that has at least 5 equal values. First, using a hypergeometric distribution, the probability that any single subset of 900 has at least 5 equal values is 4.13*10^-9. Now we consider a geometric distribution, apply some approximations, and see that we need to generate about 121 million successive subsets to have a roughly 50% chance of finding one that has at least five equal values. That might be doable in SAS in a reasonable amount of time

The original problem is still intractable. The probability of at least 5 equal values in this situation is on the order of 10^-40, so we are looking at about 2.6*10^40 subsets to have a roughly 50% chance of finding one. On the petaflop machine, this is still 10^25 seconds or 3.2*10^17 years.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Well, since combinatorial search is out, a Monte Carlo approach could be adopted once a "how good is it" function is specified (e.g. percentile matching).

As an example start with a random subset of the specified size, pick a member and replace with a randomly selected value from the population if the figure of merit improves, allowing a pre-specified number of trials. If a member can be singly evaluated pick the worst one. Stop when*k* members have failed. Restart with a new subset, etc. Keep the best one.

As an example start with a random subset of the specified size, pick a member and replace with a randomly selected value from the population if the figure of merit improves, allowing a pre-specified number of trials. If a member can be singly evaluated pick the worst one. Stop when

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

This would be my preference as well. The algorithm could also prune branches that are obvious dead ends as well, as in the exact algorithms in FREQ, MULTTEST and LOGISTIC.

But this bangs up against theoretical computer science, in an interesting way. We have traded a very tedious polynomial time process for a non-polynomial time one (a classical stopping problem). That doesn't mean it will take longer--it's just interesting.

But this bangs up against theoretical computer science, in an interesting way. We have traded a very tedious polynomial time process for a non-polynomial time one (a classical stopping problem). That doesn't mean it will take longer--it's just interesting.

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.