Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Random Observation Selection-Proc Surveyselect

Accepted Solution Solved
Reply
Contributor
Posts: 37
Accepted Solution

Random Observation Selection-Proc Surveyselect

I need to generate a random sample from a given population. I am using proc surveyselect. The code that generates the sample runs the exact same time each week. If I do not specify a seed option the computer clock is used but since the timing is so close each run the results have far too much verlap week to week. Each observation has several numeric variables and I was thinking of using one of those as a "seed" but I'm not sure if this is the best way.

Would anyone have a bit of expert advice on a good way to move forward?

Thanks very much.


Accepted Solutions
Solution
‎06-02-2012 11:20 PM
Respected Advisor
Posts: 4,641

Re: Random Observation Selection-Proc Surveyselect

The probability that two samples do NOT have any observations in common might be higher than you think. Just as the classical exercise of finding if any two students in a class have the same birthday. The probability is much higher than most poeple think intuitively. With 14000 observations and a random sampling rate of 0.00471, i.e. a sample size of about 66, the probability that two consecutive samples have NO observation in common is only about 0.73 That leaves plenty of room for duplicates to occur.

PG

PG

View solution in original post


All Replies
Respected Advisor
Posts: 3,777

Re: Random Observation Selection-Proc Surveyselect

I don't think you have a problem.  The time base seed is derived from the number of seconds since 01Jan1960:00:00:00 not the actual time of day the way you think of it.

Respected Advisor
Posts: 2,655

Re: Random Observation Selection-Proc Surveyselect

I'll echo data_null_, kind of.  I don't think you have the problem you think you have.  I think you have another sort of problem all together.  If you are getting the same values out every time then the most likely thing that is going on is that you are oversampling in some sense--either selecting a very large percentage from the population at hand, or the observations have a LOT of identical values.  I would bet on the first without more information.

Steve Denham

Contributor
Posts: 37

Re: Random Observation Selection-Proc Surveyselect

Thank you both for the assist. The total population is just over 14,000 and I am using surveyselect to sample .471.

The same observations seem to come up pretty frequently. Almost as if surveyselect assigns a random number and then begins selection at the same place it did on the previous run.

Should I be sorting the data prior to selection?

proc surveyselect data = incoming out = outgoing

     method = srs

     rate = .00471;

          id var1 var2 var3;

run;

Super User
Super User
Posts: 6,498

Re: Random Observation Selection-Proc Surveyselect

If you are worried about it use the OUTSEED option so that SurveySelect will write the seed value that it used into the output dataset.  Then you can analyze the seeds and see how similar they are.

Solution
‎06-02-2012 11:20 PM
Respected Advisor
Posts: 4,641

Re: Random Observation Selection-Proc Surveyselect

The probability that two samples do NOT have any observations in common might be higher than you think. Just as the classical exercise of finding if any two students in a class have the same birthday. The probability is much higher than most poeple think intuitively. With 14000 observations and a random sampling rate of 0.00471, i.e. a sample size of about 66, the probability that two consecutive samples have NO observation in common is only about 0.73 That leaves plenty of room for duplicates to occur.

PG

PG
Respected Advisor
Posts: 3,124

Re: Random Observation Selection-Proc Surveyselect

PG,

I don't understand why  "two consecutive samples"?  Given this context,  from my imagination ( which is often wrong), the possibility of sharing no common elements on any two independent samples (consecutive or not) should always be 0.73? Maybe I misinterpret what you mean by 'consecutive'.

Regards,

Haikuo 

Respected Advisor
Posts: 4,641

Re: Random Observation Selection-Proc Surveyselect

True, the probability is the same for any two samples. But the ones you are most likely to notice will be the consecutive ones, just like winning twice at the lottery. If you get two completely distinct samples, the probability that a third one is distinct from the first two drops to (1-0.00471)^(66+66) = 0.54, and so on. My point is simply that our intuition is geared at noticing and overreacting to coincidences and that when Jeff says "The same observations seem to come up pretty frequently", it might be that sort of thing which happens to all of us.

PG

PG
Contributor
Posts: 37

Re: Random Observation Selection-Proc Surveyselect

First, thank you all so much for all your replies and helpful comments.

We randomly select .00471 from a population of approx 14,000 each week. The program is automated and runs the exact some time of day each week. Initially I had thought (since the program runs the exact same time of day each week) I have an issue with random number generation based on that time. According to data_null_ I do not have that issue so that's a relief.

There are new additions to the 14,000 and some drop offs but not a whole lot week to week. Therefore between any two samples the population is basically the same but does change. I think (according to most responses here) I have a perception problem in that because a particular person may show up 3 times in 6 months it seems mistaken repeated selection when in reality it could simply be "luck as falls". The population is not stratified so it is completly across the entire population.

Do you think this pretty well sums up the general opinion of my issue?

Thanks again.


☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 494 views
  • 12 likes
  • 6 in conversation