Solved: Random Drawing using Surveyselect - no duplicate winners

Jay_210 · Posted 01-20-2023 12:22 PM

I'm trying to use proc surveyselect to conduct a random drawing. Entrants can have multiple entries, and I have a table that lists each entry. The problem is I'm getting duplicate winners because of the multiple entries. Is there a way to exclude duplicate observations by a certain column (Entrant_ID in this case)?

Consider this data set, but in actuality I have 100 dept names, and about 3,000 entrant IDs (most of which are duplicates. There are really about 90 unique entrant IDs per Dept Name. Entrant ID's will not duplicate over different dept's). Also, I will have a different actual sampsize for each dept, but I understand there is a way to make that reference a separate table, so I will be looking to do that.

Dept_Name	Entrant_ID
Dept 1	JRTHAL
Dept 1	TLSMIT
Dept 1	JRTHAL
Dept 1	VLNEW
Dept 2	MRREL
Dept 2	MNJON
Dept 2	MRREL
Dept 2	MNJON
Dept 2	NWCON
Dept 3	JRMCC
Dept 3	ADLON
Dept 3	JRMCC
Dept 3	ADLON
Dept 3	BFPIT
Dept 3	BFPIT

Code I am using:

proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize=10;
strata dept_name;
run;

Also, I'm not tied to using surveyselect, and I have many ways to manipulate the data if you have suggestions.

Thanks!

ballardw · Posted 01-20-2023 02:05 PM

Lets discuss the easy bit first. When you have stratified data, such as your Dept_name variable, you can specify a separate samplesize for each strata directly, though with 100 departments it may be easier to use Samprate, such as .1 to get 10% of each strata. If you want to specify a specific sample size for each strata you have two basic approaches. Use the Sampsize option with a list. Coonsider this code:

proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize= (2 1 4);
   strata dept_name;
run;

IF there are exactly 3 strata values, then the first strata in the sort order would select 2 records, the second would select 1 and the third strata would have 4 records.

With 100 or so strata that gets a bit ugly to type on a line of code. You can use the Sampsize=datasetname to provide a data set with the strata variables and the number of records to select from each strata. The desired sample size variable must be named one of _nsize_ or SampleSize.

data deptsampsize;
   input dept_name $ _nsize_;
datalines;
Dept1   2
Dept2   1
Dept3   4
;

proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize=deptsampsize;
strata dept_name;
run;

A similar approach is available if you want different sampling rates for each strata. If using the SAMPRATE=dataset option the variable name holding the desired sample rate is named _rate_ and can be range of 1-100 (treated 1% to 100%) or 0.01 to 1.

Now for a more complicated bit of how to treat multiple entries. Do you want each person to have a better chance of selection if they have more entries? Then you might consider summarizing your data to get a count of entries per Dept Entrant and use that as the input data set to Proc suryvey select indicating that count a Size variable and use one of the selection methods like PPS.

Easy to create such as summary set for use.

Proc summary data=have nway;
   class dept_name entrant_id;
   output out=entrysummary (drop=_type_);
run;
/* will have a variable _freq_ that counts how many time the combination appears*/

Proc surveyselect data=entrysummary method=PPS 
   samprate= .1 out=all_winners;
   strata Dept_name;
   size _freq_;
run;

There are multiple PPS methods, read the documentation to see if one of them fits your needs better.

There are ways around the summary step for some data but I find this clearer to use and I think for your purpose it will be much cleaner output. You might consider providing the SEED option on the Proc statement. That way you can recreate the same output given the same input when using the same SAS version (no guarantee of same result in different versions and in some cases different computers).

View solution in original post

PaigeMiller · Posted 01-20-2023 01:45 PM

It sounds like you want random sample without replacement.

First you need to remove any duplicate Entrant_ID values. Then, in PROC SURVEYSELECT use METHOD=SRS.

There should not be a need to use a different data set for each department, use the STRATA statement to have each department sampled properly.

--
Paige Miller

Jay_210 · Posted 01-20-2023 01:51 PM

I can't remove the duplicate entrant ID values because that would eliminate their multiple entries. Each line in the data set is sort of like their "raffle ticket" so if I remove any it removes their multiple "raffle tickets." I'm already using SRS, but the "without replacement" feature is going line by line. I need it to go Entrant ID by Entrant ID. I hope that makes sense.

PaigeMiller · Posted 01-20-2023 01:58 PM

So let's say for a certain department, person A has two entries, and person B and C each have 1 entry. This means that person A should have a 50% chance of getting picked, while B and C each have a 25% chance of getting picked. So, you do sampling without replacement on a data set where person A appears once but has a 50% chance of getting picked. And after person A gets picked, he shouldn't get picked again, which happens with sampling without replacement. This is METHOD=PPS (which stands for Probability Proportional to Size).

See https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_surveyselect_syntax01.htm#statu...

--
Paige Miller

ballardw · Posted 01-20-2023 02:05 PM

Lets discuss the easy bit first. When you have stratified data, such as your Dept_name variable, you can specify a separate samplesize for each strata directly, though with 100 departments it may be easier to use Samprate, such as .1 to get 10% of each strata. If you want to specify a specific sample size for each strata you have two basic approaches. Use the Sampsize option with a list. Coonsider this code:

proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize= (2 1 4);
   strata dept_name;
run;

IF there are exactly 3 strata values, then the first strata in the sort order would select 2 records, the second would select 1 and the third strata would have 4 records.

With 100 or so strata that gets a bit ugly to type on a line of code. You can use the Sampsize=datasetname to provide a data set with the strata variables and the number of records to select from each strata. The desired sample size variable must be named one of _nsize_ or SampleSize.

data deptsampsize;
   input dept_name $ _nsize_;
datalines;
Dept1   2
Dept2   1
Dept3   4
;

proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize=deptsampsize;
strata dept_name;
run;

A similar approach is available if you want different sampling rates for each strata. If using the SAMPRATE=dataset option the variable name holding the desired sample rate is named _rate_ and can be range of 1-100 (treated 1% to 100%) or 0.01 to 1.

Now for a more complicated bit of how to treat multiple entries. Do you want each person to have a better chance of selection if they have more entries? Then you might consider summarizing your data to get a count of entries per Dept Entrant and use that as the input data set to Proc suryvey select indicating that count a Size variable and use one of the selection methods like PPS.

Easy to create such as summary set for use.

Proc summary data=have nway;
   class dept_name entrant_id;
   output out=entrysummary (drop=_type_);
run;
/* will have a variable _freq_ that counts how many time the combination appears*/

Proc surveyselect data=entrysummary method=PPS 
   samprate= .1 out=all_winners;
   strata Dept_name;
   size _freq_;
run;

There are multiple PPS methods, read the documentation to see if one of them fits your needs better.

There are ways around the summary step for some data but I find this clearer to use and I think for your purpose it will be much cleaner output. You might consider providing the SEED option on the Proc statement. That way you can recreate the same output given the same input when using the same SAS version (no guarantee of same result in different versions and in some cases different computers).

Jay_210 · Posted 01-23-2023 07:54 PM

When I run this with "samprate= .1" I get an error: ERROR: The SAMPRATE= option may not be specified with METHOD=PPS. If I change it to sampsize=10 I get an error: ERROR: For METHOD=PPS, the relative size of each sampling unit must not exceed (1/SAMPSIZE).

I do like your suggestion to use Sampsize=datasetname once I can figure this out.

Ksharp · Posted 01-21-2023 09:55 AM

You could use PROC SORT to remove these duplicated value ,
proc sort data=have out=want nodupkey;by Entrant_ID;run;
After that running PROC SURVEYSELECT .

Jay_210 · Posted 01-23-2023 01:34 PM

I need to keep the duplicates in the file that I do the survey select on because those are entries in the drawing. It gives them a better chance of getting picked. What I'm trying to figure out is how to make it non select an entrant after they have already been selected (based on entrant_id).

ballardw · Posted 01-23-2023 02:52 PM

@Jay_210 wrote:

I need to keep the duplicates in the file that I do the survey select on because those are entries in the drawing. It gives them a better chance of getting picked. What I'm trying to figure out is how to make it non select an entrant after they have already been selected (based on entrant_id).

Have you read my post about PPS and summary data to select?

Jay_210 · Posted 01-23-2023 03:26 PM

I have, yes. I was going to reply but I still need to look into the PPS options for better understanding. I do have a list of entrant_ids and the number of entries they have. I was under the impression that wouldn't work for PPS by strata, but if I was wrong, it might work for me.

Watts · Posted 01-23-2023 03:38 PM

To select entrants (unique entrants) in proportion to their number of entries, you can use PROC SURVEYSELECT with the PPS option in the CLUSTER statement, like this:

CLUSTER entrant_id / PPS;

Documentation is here. The PPS option in the CLUSTER statement automatically computes the cluster (entrant_id) size measure as the number of observations (entries) in the cluster.

Jay_210 · Posted 01-23-2023 07:55 PM

This is interesting. I do get the same errors as I replied to ballardw involving samprate/sampsize though.

Random Drawing using Surveyselect - no duplicate winners

Re: Random Drawing using Surveyselect - no duplicate winners

Re: Random Drawing using Surveyselect - no duplicate winners

Re: Random Drawing using Surveyselect - no duplicate winners

Re: Random Drawing using Surveyselect - no duplicate winners

Re: Random Drawing using Surveyselect - no duplicate winners

Re: Random Drawing using Surveyselect - no duplicate winners

Re: Random Drawing using Surveyselect - no duplicate winners

Re: Random Drawing using Surveyselect - no duplicate winners

Re: Random Drawing using Surveyselect - no duplicate winners

Re: Random Drawing using Surveyselect - no duplicate winners

Re: Random Drawing using Surveyselect - no duplicate winners

Re: Random Drawing using Surveyselect - no duplicate winners

Classroom Training Available!