I'm trying to use proc surveyselect to conduct a random drawing. Entrants can have multiple entries, and I have a table that lists each entry. The problem is I'm getting duplicate winners because of the multiple entries. Is there a way to exclude duplicate observations by a certain column (Entrant_ID in this case)?
Consider this data set, but in actuality I have 100 dept names, and about 3,000 entrant IDs (most of which are duplicates. There are really about 90 unique entrant IDs per Dept Name. Entrant ID's will not duplicate over different dept's). Also, I will have a different actual sampsize for each dept, but I understand there is a way to make that reference a separate table, so I will be looking to do that.
Dept_Name | Entrant_ID |
Dept 1 | JRTHAL |
Dept 1 | TLSMIT |
Dept 1 | JRTHAL |
Dept 1 | VLNEW |
Dept 2 | MRREL |
Dept 2 | MNJON |
Dept 2 | MRREL |
Dept 2 | MNJON |
Dept 2 | NWCON |
Dept 3 | JRMCC |
Dept 3 | ADLON |
Dept 3 | JRMCC |
Dept 3 | ADLON |
Dept 3 | BFPIT |
Dept 3 | BFPIT |
Code I am using:
proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize=10;
strata dept_name;
run;
Also, I'm not tied to using surveyselect, and I have many ways to manipulate the data if you have suggestions.
Thanks!
Lets discuss the easy bit first. When you have stratified data, such as your Dept_name variable, you can specify a separate samplesize for each strata directly, though with 100 departments it may be easier to use Samprate, such as .1 to get 10% of each strata. If you want to specify a specific sample size for each strata you have two basic approaches. Use the Sampsize option with a list. Coonsider this code:
proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize= (2 1 4); strata dept_name; run;
IF there are exactly 3 strata values, then the first strata in the sort order would select 2 records, the second would select 1 and the third strata would have 4 records.
With 100 or so strata that gets a bit ugly to type on a line of code. You can use the Sampsize=datasetname to provide a data set with the strata variables and the number of records to select from each strata. The desired sample size variable must be named one of _nsize_ or SampleSize.
data deptsampsize; input dept_name $ _nsize_; datalines; Dept1 2 Dept2 1 Dept3 4 ; proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize=deptsampsize; strata dept_name; run;
A similar approach is available if you want different sampling rates for each strata. If using the SAMPRATE=dataset option the variable name holding the desired sample rate is named _rate_ and can be range of 1-100 (treated 1% to 100%) or 0.01 to 1.
Now for a more complicated bit of how to treat multiple entries. Do you want each person to have a better chance of selection if they have more entries? Then you might consider summarizing your data to get a count of entries per Dept Entrant and use that as the input data set to Proc suryvey select indicating that count a Size variable and use one of the selection methods like PPS.
Easy to create such as summary set for use.
Proc summary data=have nway; class dept_name entrant_id; output out=entrysummary (drop=_type_); run; /* will have a variable _freq_ that counts how many time the combination appears*/ Proc surveyselect data=entrysummary method=PPS samprate= .1 out=all_winners; strata Dept_name; size _freq_; run;
There are multiple PPS methods, read the documentation to see if one of them fits your needs better.
There are ways around the summary step for some data but I find this clearer to use and I think for your purpose it will be much cleaner output. You might consider providing the SEED option on the Proc statement. That way you can recreate the same output given the same input when using the same SAS version (no guarantee of same result in different versions and in some cases different computers).
It sounds like you want random sample without replacement.
First you need to remove any duplicate Entrant_ID values. Then, in PROC SURVEYSELECT use METHOD=SRS.
There should not be a need to use a different data set for each department, use the STRATA statement to have each department sampled properly.
So let's say for a certain department, person A has two entries, and person B and C each have 1 entry. This means that person A should have a 50% chance of getting picked, while B and C each have a 25% chance of getting picked. So, you do sampling without replacement on a data set where person A appears once but has a 50% chance of getting picked. And after person A gets picked, he shouldn't get picked again, which happens with sampling without replacement. This is METHOD=PPS (which stands for Probability Proportional to Size).
Lets discuss the easy bit first. When you have stratified data, such as your Dept_name variable, you can specify a separate samplesize for each strata directly, though with 100 departments it may be easier to use Samprate, such as .1 to get 10% of each strata. If you want to specify a specific sample size for each strata you have two basic approaches. Use the Sampsize option with a list. Coonsider this code:
proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize= (2 1 4); strata dept_name; run;
IF there are exactly 3 strata values, then the first strata in the sort order would select 2 records, the second would select 1 and the third strata would have 4 records.
With 100 or so strata that gets a bit ugly to type on a line of code. You can use the Sampsize=datasetname to provide a data set with the strata variables and the number of records to select from each strata. The desired sample size variable must be named one of _nsize_ or SampleSize.
data deptsampsize; input dept_name $ _nsize_; datalines; Dept1 2 Dept2 1 Dept3 4 ; proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize=deptsampsize; strata dept_name; run;
A similar approach is available if you want different sampling rates for each strata. If using the SAMPRATE=dataset option the variable name holding the desired sample rate is named _rate_ and can be range of 1-100 (treated 1% to 100%) or 0.01 to 1.
Now for a more complicated bit of how to treat multiple entries. Do you want each person to have a better chance of selection if they have more entries? Then you might consider summarizing your data to get a count of entries per Dept Entrant and use that as the input data set to Proc suryvey select indicating that count a Size variable and use one of the selection methods like PPS.
Easy to create such as summary set for use.
Proc summary data=have nway; class dept_name entrant_id; output out=entrysummary (drop=_type_); run; /* will have a variable _freq_ that counts how many time the combination appears*/ Proc surveyselect data=entrysummary method=PPS samprate= .1 out=all_winners; strata Dept_name; size _freq_; run;
There are multiple PPS methods, read the documentation to see if one of them fits your needs better.
There are ways around the summary step for some data but I find this clearer to use and I think for your purpose it will be much cleaner output. You might consider providing the SEED option on the Proc statement. That way you can recreate the same output given the same input when using the same SAS version (no guarantee of same result in different versions and in some cases different computers).
I need to keep the duplicates in the file that I do the survey select on because those are entries in the drawing. It gives them a better chance of getting picked. What I'm trying to figure out is how to make it non select an entrant after they have already been selected (based on entrant_id).
@Jay_210 wrote:
I need to keep the duplicates in the file that I do the survey select on because those are entries in the drawing. It gives them a better chance of getting picked. What I'm trying to figure out is how to make it non select an entrant after they have already been selected (based on entrant_id).
Have you read my post about PPS and summary data to select?
To select entrants (unique entrants) in proportion to their number of entries, you can use PROC SURVEYSELECT with the PPS option in the CLUSTER statement, like this:
CLUSTER entrant_id / PPS;
Documentation is here. The PPS option in the CLUSTER statement automatically computes the cluster (entrant_id) size measure as the number of observations (entries) in the cluster.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.