What criteria are you going to use to decide a ranking of the subsets? Is it critical to determine the absolute best on that scale, or would any of, say, 100 subsets that were approximately equal be sufficient?
And you are absolutely right that Binomial(158 792, 5 000 ) is a very big number--you are in the cosmological realm with that. And that is what makes me wonder why you would need the "best" subset, rather than a properly stratified random sample.
The seven variables can have the values of any of the integers 1-5. Then there are 5^7=78 125 possible different combinations.
I generate a SAS table with all the possible combinations. That's easy to do with SAS code.
Then I would like to know if in a subset, for instance with 900 of the 78 125 records, there would be at least one record where the values would be equal for at least 5 of the 7 variables, if I checked it against any of the 78 125 records.
If it was possible to even find the best(=with the biggest number of records with 5 values equal, or 6 values, if that was possible) combination of a 900 record subset, that would even be better.
Are any of these things possible to find out with SAS ( in a lifetime )?
Even the smaller problem is not tractable, if we are to examine all possible combinations. Let's assume that you have access to a petaflop speed machine, and it takes 10 floating point operations to characterize a single combination. That means we could generate 10^14 combinations in one second. Using Ramanujan's approximation for log n!, I get something like 10^2106 years to generate all possible 900 record subsets. So let's not look at all possibles, it is just too daunting.
Instead, let's see how many subsets we would have to look at to have a 50% chance of finding one that has at least 5 equal values. First, using a hypergeometric distribution, the probability that any single subset of 900 has at least 5 equal values is 4.13*10^-9. Now we consider a geometric distribution, apply some approximations, and see that we need to generate about 121 million successive subsets to have a roughly 50% chance of finding one that has at least five equal values. That might be doable in SAS in a reasonable amount of time
The original problem is still intractable. The probability of at least 5 equal values in this situation is on the order of 10^-40, so we are looking at about 2.6*10^40 subsets to have a roughly 50% chance of finding one. On the petaflop machine, this is still 10^25 seconds or 3.2*10^17 years.
Well, since combinatorial search is out, a Monte Carlo approach could be adopted once a "how good is it" function is specified (e.g. percentile matching).
As an example start with a random subset of the specified size, pick a member and replace with a randomly selected value from the population if the figure of merit improves, allowing a pre-specified number of trials. If a member can be singly evaluated pick the worst one. Stop when k members have failed. Restart with a new subset, etc. Keep the best one.
This would be my preference as well. The algorithm could also prune branches that are obvious dead ends as well, as in the exact algorithms in FREQ, MULTTEST and LOGISTIC.
But this bangs up against theoretical computer science, in an interesting way. We have traded a very tedious polynomial time process for a non-polynomial time one (a classical stopping problem). That doesn't mean it will take longer--it's just interesting.