BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Jay_210
Calcite | Level 5

I'm trying to use proc surveyselect to conduct a random drawing. Entrants can have multiple entries, and I have a table that lists each entry. The problem is I'm getting duplicate winners because of the multiple entries. Is there a way to exclude duplicate observations by a certain column (Entrant_ID in this case)? 

Consider this data set, but in actuality I have 100 dept names, and about 3,000 entrant IDs (most of which are duplicates. There are really about 90 unique entrant IDs per Dept Name. Entrant ID's will not duplicate over different dept's). Also, I will have a different actual sampsize for each dept, but I understand there is a way to make that reference a separate table, so I will be looking to do that. 

Dept_NameEntrant_ID
Dept 1JRTHAL
Dept 1TLSMIT
Dept 1JRTHAL
Dept 1VLNEW
Dept 2MRREL
Dept 2MNJON
Dept 2MRREL
Dept 2MNJON
Dept 2NWCON
Dept 3JRMCC
Dept 3ADLON
Dept 3JRMCC
Dept 3ADLON
Dept 3BFPIT
Dept 3BFPIT

Code I am using:

proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize=10;
strata dept_name;
run;

 

Also, I'm not tied to using surveyselect, and I have many ways to manipulate the data if you have suggestions. 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Lets discuss the easy bit first. When you have stratified data, such as your Dept_name variable, you can specify a separate samplesize for each strata directly, though with 100 departments it may be easier to use Samprate, such as .1 to get 10% of each strata. If you want to specify a specific sample size for each strata you have two basic approaches. Use the Sampsize option with a list. Coonsider this code:

proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize= (2 1 4);
   strata dept_name;
run;

IF there are exactly 3 strata values, then the first strata in the sort order would select 2 records, the second would select 1 and the third strata would have 4 records.

With 100 or so strata that gets a bit ugly to type on a line of code. You can use the Sampsize=datasetname to provide a data set with the strata variables and the number of records to select from each strata. The desired sample size variable must be named one of _nsize_ or SampleSize.

data deptsampsize;
   input dept_name $ _nsize_;
datalines;
Dept1   2
Dept2   1
Dept3   4
;

proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize=deptsampsize;
strata dept_name;
run;

A similar approach is available if you want different sampling rates for each strata. If using the SAMPRATE=dataset option the variable name holding the desired sample rate is named _rate_ and can be range of 1-100 (treated 1% to 100%) or 0.01 to 1.

Now for a more complicated bit of how to treat multiple entries. Do you want each person to have a better chance of selection if they have more entries? Then you might consider summarizing your data to get a count of entries per Dept Entrant and use that as the input data set to Proc suryvey select indicating that count a Size variable and use one of the selection methods like PPS.

Easy to create such as summary set for use.

Proc summary data=have nway;
   class dept_name entrant_id;
   output out=entrysummary (drop=_type_);
run;
/* will have a variable _freq_ that counts how many time the combination appears*/

Proc surveyselect data=entrysummary method=PPS 
   samprate= .1 out=all_winners;
   strata Dept_name;
   size _freq_;
run;

There are multiple PPS methods, read the documentation to see if one of them fits your needs better.

 

There are ways around the summary step for some data but I find this clearer to use and I think for your purpose it will be much cleaner output. You might consider providing the SEED option on the Proc statement. That way you can recreate the same output given the same input when using the same SAS version (no guarantee of same result in different versions and in some cases different computers).

View solution in original post

11 REPLIES 11
PaigeMiller
Diamond | Level 26

It sounds like you want random sample without replacement.

 

First you need to remove any duplicate Entrant_ID values. Then, in PROC SURVEYSELECT use METHOD=SRS.

 

There should not be a need to use a different data set for each department, use the STRATA statement to have each department sampled properly.

--
Paige Miller
Jay_210
Calcite | Level 5
I can't remove the duplicate entrant ID values because that would eliminate their multiple entries. Each line in the data set is sort of like their "raffle ticket" so if I remove any it removes their multiple "raffle tickets." I'm already using SRS, but the "without replacement" feature is going line by line. I need it to go Entrant ID by Entrant ID. I hope that makes sense.
PaigeMiller
Diamond | Level 26

So let's say for a certain department, person A has two entries, and person B and C each have 1 entry. This means that person A should have a 50% chance of getting picked, while B and C each have a 25% chance of getting picked. So, you do sampling without replacement on a data set where person A appears once but has a 50% chance of getting picked. And after person A gets picked, he shouldn't get picked again, which happens with sampling without replacement. This is METHOD=PPS (which stands for Probability Proportional to Size).

 

See https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_surveyselect_syntax01.htm#statu...

--
Paige Miller
ballardw
Super User

Lets discuss the easy bit first. When you have stratified data, such as your Dept_name variable, you can specify a separate samplesize for each strata directly, though with 100 departments it may be easier to use Samprate, such as .1 to get 10% of each strata. If you want to specify a specific sample size for each strata you have two basic approaches. Use the Sampsize option with a list. Coonsider this code:

proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize= (2 1 4);
   strata dept_name;
run;

IF there are exactly 3 strata values, then the first strata in the sort order would select 2 records, the second would select 1 and the third strata would have 4 records.

With 100 or so strata that gets a bit ugly to type on a line of code. You can use the Sampsize=datasetname to provide a data set with the strata variables and the number of records to select from each strata. The desired sample size variable must be named one of _nsize_ or SampleSize.

data deptsampsize;
   input dept_name $ _nsize_;
datalines;
Dept1   2
Dept2   1
Dept3   4
;

proc surveyselect data=work.random_draw_entries out=all_winners method=srs reps=1 sampsize=deptsampsize;
strata dept_name;
run;

A similar approach is available if you want different sampling rates for each strata. If using the SAMPRATE=dataset option the variable name holding the desired sample rate is named _rate_ and can be range of 1-100 (treated 1% to 100%) or 0.01 to 1.

Now for a more complicated bit of how to treat multiple entries. Do you want each person to have a better chance of selection if they have more entries? Then you might consider summarizing your data to get a count of entries per Dept Entrant and use that as the input data set to Proc suryvey select indicating that count a Size variable and use one of the selection methods like PPS.

Easy to create such as summary set for use.

Proc summary data=have nway;
   class dept_name entrant_id;
   output out=entrysummary (drop=_type_);
run;
/* will have a variable _freq_ that counts how many time the combination appears*/

Proc surveyselect data=entrysummary method=PPS 
   samprate= .1 out=all_winners;
   strata Dept_name;
   size _freq_;
run;

There are multiple PPS methods, read the documentation to see if one of them fits your needs better.

 

There are ways around the summary step for some data but I find this clearer to use and I think for your purpose it will be much cleaner output. You might consider providing the SEED option on the Proc statement. That way you can recreate the same output given the same input when using the same SAS version (no guarantee of same result in different versions and in some cases different computers).

Jay_210
Calcite | Level 5
When I run this with "samprate= .1" I get an error: ERROR: The SAMPRATE= option may not be specified with METHOD=PPS. If I change it to sampsize=10 I get an error: ERROR: For METHOD=PPS, the relative size of each sampling unit must not exceed (1/SAMPSIZE).

I do like your suggestion to use Sampsize=datasetname once I can figure this out.
Ksharp
Super User
You could use PROC SORT to remove these duplicated value ,
proc sort data=have out=want nodupkey;by Entrant_ID;run;
After that running PROC SURVEYSELECT .
Jay_210
Calcite | Level 5

I need to keep the duplicates in the file that I do the survey select on because those are entries in the drawing. It gives them a better chance of getting picked. What I'm trying to figure out is how to make it non select an entrant after they have already been selected (based on entrant_id). 

ballardw
Super User

@Jay_210 wrote:

I need to keep the duplicates in the file that I do the survey select on because those are entries in the drawing. It gives them a better chance of getting picked. What I'm trying to figure out is how to make it non select an entrant after they have already been selected (based on entrant_id). 


Have you read my post about PPS and summary data to select?

Jay_210
Calcite | Level 5
I have, yes. I was going to reply but I still need to look into the PPS options for better understanding. I do have a list of entrant_ids and the number of entries they have. I was under the impression that wouldn't work for PPS by strata, but if I was wrong, it might work for me.
Watts
SAS Employee

To select entrants (unique entrants) in proportion to their number of entries, you can use PROC SURVEYSELECT with the PPS option in the CLUSTER statement, like this:

CLUSTER entrant_id / PPS;

Documentation is here. The PPS option in the CLUSTER statement automatically computes the cluster (entrant_id) size measure as the number of observations (entries) in the cluster. 

Jay_210
Calcite | Level 5
This is interesting. I do get the same errors as I replied to ballardw involving samprate/sampsize though.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 3807 views
  • 5 likes
  • 5 in conversation