- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a client test group that differs based on store location and client spend group ( high, medium low).
I have a larger group of similar control clients created based on past history. I would like to select a random control client for each test client that shares the same store location and spend level.
example of tables:
test:
client id | store_number | spend_level |
1 | 30 | 1 |
2 | 5 | 1 |
3 | 10 | 2 |
4 | 8 | 2 |
5 | 8 | 3 |
I have a potential control group that looks like this but with many more clients, and not weighted the same by store and spend.
How do I create a control sample that has the same number of clients by store and spend.
So if Store 3 has 10 low spend, 20 high spend and 3 medium spend clients, and store 5 has 2 low spend, 1 high spend, and 8 medium spend clients, the control will have the same.
I'd like this sample to be random within these conditions.
thank you in advance!
Meredith
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can do that with PROC SURVEYSELECT by specifying a secondary input data set (SAMPSIZE=).
PROC SURVEYSELECT data=input_SAS_dataset_ToSampleFrom /* control clients */
SAMPSIZE=Secondary_Input_Data_Set
out=SampleSRSwithinStrata
method=srs ; /* srs = Simple Random Sampling */
strata store_number spend_level;
run;
You can provide stratum sample sizes in the _NSIZE_
(or SampleSize
) variable in the SAMPSIZE= data set.
The secondary input data set must contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the secondary data set as in the DATA= data set.
SAS/STAT(R) 14.1 User's Guide
The SURVEYSELECT Procedure
secondary input data set
http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_surveyselect...
You can get the required stratum sample sizes with a PROC FREQ on your client test group.
PROC FREQ; tables store_number * spend_level / out=count; run;
Good luck with this!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I forgot to say that, for the SURVEYSELECT solution, you should put a SEED= value that is a (arbitrary) strictly positive integer (>0). Otherwise your results are not reproducible.
From the online doc ...
SEED=number
specifies the initial seed for random number generation. The SEED= value must be a positive integer. If you do not specify the SEED= option, or if the SEED= value is negative or 0, PROC SURVEYSELECT uses the time of day from the computer’s clock to obtain the initial seed.
Koen
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can also do your own randomisation and gain greater control
proc sql;
create table numbers as
select store_number, spend_level, count(*) as n
from test
group by store_number, spend_level;
quit;
data rndControls;
call streaminit(78576);
set controls;
rnd = rand("UNIFORM");
run;
proc sort data=rndControls; by store_number spend_level rnd; run;
data matchControls;
merge numbers rndControls (in=inControls); by store_number spend_level;
if first.spend_level then count = 0;
if count <= n then do;
count + inControls;
if last.spend_level and count < n
then missing = n - count;
output;
end;
keep store_number spend_level clientId missing;
run;
proc print data=matchControls noobs; run;
the idea is to put the controls in random order within store_number and spend_level groups and then pick the first control clients that are required to match the test groups.