turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Random Observation Selection-Proc Surveyselect

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-31-2012 07:23 PM

I need to generate a random sample from a given population. I am using proc surveyselect. The code that generates the sample runs the exact same time each week. If I do not specify a seed option the computer clock is used but since the timing is so close each run the results have far too much verlap week to week. Each observation has several numeric variables and I was thinking of using one of those as a "seed" but I'm not sure if this is the best way.

Would anyone have a bit of expert advice on a good way to move forward?

Thanks very much.

Accepted Solutions

Solution

06-02-2012
11:20 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Jeff_DOC

06-02-2012 11:20 PM

The probability that two samples do NOT have any observations in common might be higher than you think. Just as the classical exercise of finding if any two students in a class have the same birthday. The probability is much higher than most poeple think intuitively. With 14000 observations and a random sampling rate of 0.00471, i.e. a sample size of about 66, the probability that two consecutive samples have NO observation in common is only about 0.73 That leaves plenty of room for duplicates to occur.

PG

PG

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Jeff_DOC

05-31-2012 08:14 PM

I don't think you have a problem. The time base seed is derived from the number of seconds since 01Jan1960:00:00:00 not the actual time of day the way you think of it.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to data_null__

06-01-2012 09:05 AM

I'll echo data_null_, kind of. I don't think you have the problem you think you have. I think you have another sort of problem all together. If you are getting the same values out every time then the most likely thing that is going on is that you are oversampling in some sense--either selecting a very large percentage from the population at hand, or the observations have a LOT of identical values. I would bet on the first without more information.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Jeff_DOC

06-01-2012 05:52 PM

Thank you both for the assist. The total population is just over 14,000 and I am using surveyselect to sample .471.

The same observations seem to come up pretty frequently. Almost as if surveyselect assigns a random number and then begins selection at the same place it did on the previous run.

Should I be sorting the data prior to selection?

proc surveyselect data = incoming out = outgoing

method = srs

rate = .00471;

id var1 var2 var3;

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Jeff_DOC

06-02-2012 11:00 AM

If you are worried about it use the OUTSEED option so that SurveySelect will write the seed value that it used into the output dataset. Then you can analyze the seeds and see how similar they are.

Solution

06-02-2012
11:20 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Jeff_DOC

06-02-2012 11:20 PM

The probability that two samples do NOT have any observations in common might be higher than you think. Just as the classical exercise of finding if any two students in a class have the same birthday. The probability is much higher than most poeple think intuitively. With 14000 observations and a random sampling rate of 0.00471, i.e. a sample size of about 66, the probability that two consecutive samples have NO observation in common is only about 0.73 That leaves plenty of room for duplicates to occur.

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PGStats

06-03-2012 01:59 PM

PG,

I don't understand why "two consecutive samples"? Given this context, from my imagination ( which is often wrong), the possibility of sharing no common elements on any two independent samples (consecutive or not) should always be 0.73? Maybe I misinterpret what you mean by 'consecutive'.

Regards,

Haikuo

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Haikuo

06-03-2012 03:09 PM

True, the probability is the same for any two samples. But the ones you are most likely to notice will be the consecutive ones, just like winning twice at the lottery. If you get two completely distinct samples, the probability that a third one is distinct from the first two drops to (1-0.00471)^(66+66) = 0.54, and so on. My point is simply that our intuition is geared at noticing and overreacting to coincidences and that when Jeff says "The same observations seem to come up pretty frequently", it might be that sort of thing which happens to all of us.

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Jeff_DOC

06-04-2012 02:10 PM

First, thank you all so much for all your replies and helpful comments.

We randomly select .00471 from a population of approx 14,000 each week. The program is automated and runs the exact some time of day each week. Initially I had thought (since the program runs the exact same time of day each week) I have an issue with random number generation based on that time. According to data_null_ I do not have that issue so that's a relief.

There are new additions to the 14,000 and some drop offs but not a whole lot week to week. Therefore between any two samples the population is basically the same but does change. I think (according to most responses here) I have a perception problem in that because a particular person may show up 3 times in 6 months it seems mistaken repeated selection when in reality it could simply be "luck as falls". The population is not stratified so it is completly across the entire population.

Do you think this pretty well sums up the general opinion of my issue?

Thanks again.