Programming the statistical procedures from SAS

Complex Random Sampling

Reply
Occasional Contributor
Posts: 6

Complex Random Sampling

I have tried to search for an answer but its hard to know what term to search for!!

I want to know how to get a random sample where I can specify specific attributes that must be represented in my sample.

The sample is for mailing proofs rather than actual control grouping.

An example:

From a mailing list I want a sample that is independent of each factor, but I need specific sized groups of each factor, such as 1 example from each state, 10 examples of each payment frequency, 3 examples from each product type etc.

I have about 20 attributes that need to be represented in the sample

One option I am considering is consolidating the attributes so that there are less things to choose from but I am really keen to see if I could do this.

Currently I have tried the random sampling tool in enterprise guide, but the strata function seems to want to choose all possible combinations of the factors I am choosing.

Would appreciate suggestions or links to information

thanks for your time

SAS Super FREQ
Posts: 3,306

Re: Complex Random Sampling

Have you tried to use PROC SURVEYSELECT? Here is the complete doc: SAS/STAT(R) 13.1 User's Guide

The STRATA statement with the ALLOC= option is very powerful.

Occasional Contributor
Posts: 6

Re: Complex Random Sampling

Hi thanks for the reply, I will read up some more on it, I was struggling with the fact it wanted to do every possible combination rather than just make sure it included specific variables unrelated to the others (not sure if that makes sense)

Grand Advisor
Posts: 10,039

Re: Complex Random Sampling

If you want a SINGLE sample with 20 attributes the STRATA does tend to look at all the combinations.

With 20 attributes, which I assume you mean variables, you would be looking at a sample size of at least 2**20 (more than 1 million combinations) records. Is that what you are looking for?

Since the generated sampling weights appear unimportant if I understand, You might want to make separate samples for subsets of your attributes, such as State and Payment frequency, State and Product. If you have unique identifiers insure that variable comes along. Then combine the identifiers to make a unique list and use that to select records from the full data. This is likely to generate more than a minimum set but should include at least some from each required category.

Respected Advisor
Posts: 4,606

Re: Complex Random Sampling

This problem can be formulated as a Constraints Satisfaction Problem. If you have access to SAS/OR then you could try something along this example:

/* Generate some example data: 20 IDs have 10 random properties */

data test;

array prop{10};

call streaminit(77651);

do id = 1 to 20;

    do p = 1 to dim(Prop);

        prop{p} = rand('uniform') > 0.8;

        end;

    output;

    end;

drop p;

run;

/* Specify the minimum sampling sizes for each property,

  = 2 for all properties in this example */

data size;

do p = 1 to 10;

    _name_ = cats('prop', p);

    _RHS_ = 2;      /* Right hand side of the linear constraint */

    _TYPE_ = "GE";  /* Sample size must be >= _RHS_ */

    output;

    end;

drop p;

run;

/* Make the IDs into variables and the properties into observations */

proc transpose data=test out=constr prefix=id;

id id;

run;

/* Create linear constraints : The sum of selected IDs having each

property must be >= _RHS_ */

proc sql;

create table lincon(drop=_name_) as

select c.*, s._RHS_, s._TYPE_ length=3

from constr as c inner join

    size as s on c._name_=s._name_;

quit;

/* Add an objective function : minimize the total number of selected IDs */

data lincon;

set lincon end=last;

output;

array id{*} id:;

if last then do;

    do i = 1 to dim(id);

        id{i} = 1;

        end;

    call missing(_RHS_);

    _TYPE_ = "MIN";

    output;

    end;

drop i;

run;

/* Cross your fingers and call proc CLP. Specify that all ID variables are binary.

Ask for 10 optimal solutions (possible samples of minimum size). */

proc CLP condata=lincon USECONDATAVARS=1 domain=[0,1] maxsolns=10 out=soln;

run;

proc print data=soln; run;

NOTE: Number of LINEAR constraints: 10.

NOTE: Total number of arrays: 0.

NOTE: Total number of variables: 20.

NOTE: Total number of constraints: 10.

NOTE: Required number of solutions found (10).

NOTE: Minimum objective value found: 6.

NOTE: There were 11 observations read from the data set WORK.LINCON.

NOTE: The data set WORK.SOLN has 10 observations and 20 variables.

NOTE: PROCEDURE CLP used (Total process time):

      real time           0.04 seconds

      cpu time            0.04 seconds

                                          i   i   i   i   i   i   i   i   i   i   i

  O   i   i   i   i   i   i   i   i   i   d   d   d   d   d   d   d   d   d   d   d

  b   d   d   d   d   d   d   d   d   d   1   1   1   1   1   1   1   1   1   1   2

  s   1   2   3   4   5   6   7   8   9   0   1   2   3   4   5   6   7   8   9   0

  1   0   0   0   0   0   0   0   0   0   0   1   0   0   1   0   1   1   0   1   1

  2   0   0   0   0   0   0   0   0   0   0   1   0   0   1   1   1   0   0   1   1

  3   0   0   0   0   0   0   0   0   0   0   1   0   0   1   1   1   1   0   0   1

  4   0   0   0   0   0   0   0   0   0   0   1   0   1   1   1   0   0   0   1   1

  5   0   0   0   0   0   0   0   0   0   0   1   0   1   1   0   0   1   0   1   1

  6   0   0   0   0   0   0   1   0   0   0   1   0   0   1   1   1   0   0   0   1

  7   0   0   0   0   0   0   1   0   0   0   1   0   0   1   0   1   1   0   0   1

  8   0   0   0   0   0   0   1   0   0   0   1   0   0   1   0   1   0   0   1   1

  9   0   0   0   0   0   0   1   0   0   0   1   0   1   1   0   0   0   0   1   1

10   0   0   0   0   0   0   0   0   0   1   1   0   0   1   1   0   0   0   1   1

PG

PG
Ask a Question
Discussion stats
  • 4 replies
  • 544 views
  • 3 likes
  • 4 in conversation