BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Jeff_DOC
Pyrite | Level 9

I need to generate a random sample from a given population. I am using proc surveyselect. The code that generates the sample runs the exact same time each week. If I do not specify a seed option the computer clock is used but since the timing is so close each run the results have far too much verlap week to week. Each observation has several numeric variables and I was thinking of using one of those as a "seed" but I'm not sure if this is the best way.

Would anyone have a bit of expert advice on a good way to move forward?

Thanks very much.

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

The probability that two samples do NOT have any observations in common might be higher than you think. Just as the classical exercise of finding if any two students in a class have the same birthday. The probability is much higher than most poeple think intuitively. With 14000 observations and a random sampling rate of 0.00471, i.e. a sample size of about 66, the probability that two consecutive samples have NO observation in common is only about 0.73 That leaves plenty of room for duplicates to occur.

PG

PG

View solution in original post

8 REPLIES 8
data_null__
Jade | Level 19

I don't think you have a problem.  The time base seed is derived from the number of seconds since 01Jan1960:00:00:00 not the actual time of day the way you think of it.

SteveDenham
Jade | Level 19

I'll echo data_null_, kind of.  I don't think you have the problem you think you have.  I think you have another sort of problem all together.  If you are getting the same values out every time then the most likely thing that is going on is that you are oversampling in some sense--either selecting a very large percentage from the population at hand, or the observations have a LOT of identical values.  I would bet on the first without more information.

Steve Denham

Jeff_DOC
Pyrite | Level 9

Thank you both for the assist. The total population is just over 14,000 and I am using surveyselect to sample .471.

The same observations seem to come up pretty frequently. Almost as if surveyselect assigns a random number and then begins selection at the same place it did on the previous run.

Should I be sorting the data prior to selection?

proc surveyselect data = incoming out = outgoing

     method = srs

     rate = .00471;

          id var1 var2 var3;

run;

Tom
Super User Tom
Super User

If you are worried about it use the OUTSEED option so that SurveySelect will write the seed value that it used into the output dataset.  Then you can analyze the seeds and see how similar they are.

PGStats
Opal | Level 21

The probability that two samples do NOT have any observations in common might be higher than you think. Just as the classical exercise of finding if any two students in a class have the same birthday. The probability is much higher than most poeple think intuitively. With 14000 observations and a random sampling rate of 0.00471, i.e. a sample size of about 66, the probability that two consecutive samples have NO observation in common is only about 0.73 That leaves plenty of room for duplicates to occur.

PG

PG
Haikuo
Onyx | Level 15

PG,

I don't understand why  "two consecutive samples"?  Given this context,  from my imagination ( which is often wrong), the possibility of sharing no common elements on any two independent samples (consecutive or not) should always be 0.73? Maybe I misinterpret what you mean by 'consecutive'.

Regards,

Haikuo 

PGStats
Opal | Level 21

True, the probability is the same for any two samples. But the ones you are most likely to notice will be the consecutive ones, just like winning twice at the lottery. If you get two completely distinct samples, the probability that a third one is distinct from the first two drops to (1-0.00471)^(66+66) = 0.54, and so on. My point is simply that our intuition is geared at noticing and overreacting to coincidences and that when Jeff says "The same observations seem to come up pretty frequently", it might be that sort of thing which happens to all of us.

PG

PG
Jeff_DOC
Pyrite | Level 9

First, thank you all so much for all your replies and helpful comments.

We randomly select .00471 from a population of approx 14,000 each week. The program is automated and runs the exact some time of day each week. Initially I had thought (since the program runs the exact same time of day each week) I have an issue with random number generation based on that time. According to data_null_ I do not have that issue so that's a relief.

There are new additions to the 14,000 and some drop offs but not a whole lot week to week. Therefore between any two samples the population is basically the same but does change. I think (according to most responses here) I have a perception problem in that because a particular person may show up 3 times in 6 months it seems mistaken repeated selection when in reality it could simply be "luck as falls". The population is not stratified so it is completly across the entire population.

Do you think this pretty well sums up the general opinion of my issue?

Thanks again.


sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1499 views
  • 12 likes
  • 6 in conversation