11-23-2015 04:47 PM
I have a client test group that differs based on store location and client spend group ( high, medium low).
I have a larger group of similar control clients created based on past history. I would like to select a random control client for each test client that shares the same store location and spend level.
example of tables:
I have a potential control group that looks like this but with many more clients, and not weighted the same by store and spend.
How do I create a control sample that has the same number of clients by store and spend.
So if Store 3 has 10 low spend, 20 high spend and 3 medium spend clients, and store 5 has 2 low spend, 1 high spend, and 8 medium spend clients, the control will have the same.
I'd like this sample to be random within these conditions.
thank you in advance!
11-23-2015 06:30 PM
You can do that with PROC SURVEYSELECT by specifying a secondary input data set (SAMPSIZE=).
PROC SURVEYSELECT data=input_SAS_dataset_ToSampleFrom /* control clients */
method=srs ; /* srs = Simple Random Sampling */
strata store_number spend_level;
You can provide stratum sample sizes in the
SampleSize) variable in the SAMPSIZE= data set.
The secondary input data set must contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the secondary data set as in the DATA= data set.
SAS/STAT(R) 14.1 User's Guide
The SURVEYSELECT Procedure
secondary input data set
You can get the required stratum sample sizes with a PROC FREQ on your client test group.
PROC FREQ; tables store_number * spend_level / out=count; run;
Good luck with this!
11-24-2015 04:36 AM
I forgot to say that, for the SURVEYSELECT solution, you should put a SEED= value that is a (arbitrary) strictly positive integer (>0). Otherwise your results are not reproducible.
From the online doc ...
specifies the initial seed for random number generation. The SEED= value must be a positive integer. If you do not specify the SEED= option, or if the SEED= value is negative or 0, PROC SURVEYSELECT uses the time of day from the computer’s clock to obtain the initial seed.
11-23-2015 10:05 PM
You can also do your own randomisation and gain greater control
proc sql; create table numbers as select store_number, spend_level, count(*) as n from test group by store_number, spend_level; quit; data rndControls; call streaminit(78576); set controls; rnd = rand("UNIFORM"); run; proc sort data=rndControls; by store_number spend_level rnd; run; data matchControls; merge numbers rndControls (in=inControls); by store_number spend_level; if first.spend_level then count = 0; if count <= n then do; count + inControls; if last.spend_level and count < n then missing = n - count; output; end; keep store_number spend_level clientId missing; run; proc print data=matchControls noobs; run;
the idea is to put the controls in random order within store_number and spend_level groups and then pick the first control clients that are required to match the test groups.