BookmarkSubscribeRSS Feed
popples123
Calcite | Level 5

I have tried to search for an answer but its hard to know what term to search for!!

I want to know how to get a random sample where I can specify specific attributes that must be represented in my sample.

The sample is for mailing proofs rather than actual control grouping.

An example:

From a mailing list I want a sample that is independent of each factor, but I need specific sized groups of each factor, such as 1 example from each state, 10 examples of each payment frequency, 3 examples from each product type etc.

I have about 20 attributes that need to be represented in the sample

One option I am considering is consolidating the attributes so that there are less things to choose from but I am really keen to see if I could do this.

Currently I have tried the random sampling tool in enterprise guide, but the strata function seems to want to choose all possible combinations of the factors I am choosing.

Would appreciate suggestions or links to information

thanks for your time

4 REPLIES 4
Rick_SAS
SAS Super FREQ

Have you tried to use PROC SURVEYSELECT? Here is the complete doc: SAS/STAT(R) 13.1 User's Guide

The STRATA statement with the ALLOC= option is very powerful.

popples123
Calcite | Level 5

Hi thanks for the reply, I will read up some more on it, I was struggling with the fact it wanted to do every possible combination rather than just make sure it included specific variables unrelated to the others (not sure if that makes sense)

ballardw
Super User

If you want a SINGLE sample with 20 attributes the STRATA does tend to look at all the combinations.

With 20 attributes, which I assume you mean variables, you would be looking at a sample size of at least 2**20 (more than 1 million combinations) records. Is that what you are looking for?

Since the generated sampling weights appear unimportant if I understand, You might want to make separate samples for subsets of your attributes, such as State and Payment frequency, State and Product. If you have unique identifiers insure that variable comes along. Then combine the identifiers to make a unique list and use that to select records from the full data. This is likely to generate more than a minimum set but should include at least some from each required category.

PGStats
Opal | Level 21

This problem can be formulated as a Constraints Satisfaction Problem. If you have access to SAS/OR then you could try something along this example:

/* Generate some example data: 20 IDs have 10 random properties */

data test;

array prop{10};

call streaminit(77651);

do id = 1 to 20;

    do p = 1 to dim(Prop);

        prop{p} = rand('uniform') > 0.8;

        end;

    output;

    end;

drop p;

run;

/* Specify the minimum sampling sizes for each property,

  = 2 for all properties in this example */

data size;

do p = 1 to 10;

    _name_ = cats('prop', p);

    _RHS_ = 2;      /* Right hand side of the linear constraint */

    _TYPE_ = "GE";  /* Sample size must be >= _RHS_ */

    output;

    end;

drop p;

run;

/* Make the IDs into variables and the properties into observations */

proc transpose data=test out=constr prefix=id;

id id;

run;

/* Create linear constraints : The sum of selected IDs having each

property must be >= _RHS_ */

proc sql;

create table lincon(drop=_name_) as

select c.*, s._RHS_, s._TYPE_ length=3

from constr as c inner join

    size as s on c._name_=s._name_;

quit;

/* Add an objective function : minimize the total number of selected IDs */

data lincon;

set lincon end=last;

output;

array id{*} id:;

if last then do;

    do i = 1 to dim(id);

        id{i} = 1;

        end;

    call missing(_RHS_);

    _TYPE_ = "MIN";

    output;

    end;

drop i;

run;

/* Cross your fingers and call proc CLP. Specify that all ID variables are binary.

Ask for 10 optimal solutions (possible samples of minimum size). */

proc CLP condata=lincon USECONDATAVARS=1 domain=[0,1] maxsolns=10 out=soln;

run;

proc print data=soln; run;

NOTE: Number of LINEAR constraints: 10.

NOTE: Total number of arrays: 0.

NOTE: Total number of variables: 20.

NOTE: Total number of constraints: 10.

NOTE: Required number of solutions found (10).

NOTE: Minimum objective value found: 6.

NOTE: There were 11 observations read from the data set WORK.LINCON.

NOTE: The data set WORK.SOLN has 10 observations and 20 variables.

NOTE: PROCEDURE CLP used (Total process time):

      real time           0.04 seconds

      cpu time            0.04 seconds

                                          i   i   i   i   i   i   i   i   i   i   i

  O   i   i   i   i   i   i   i   i   i   d   d   d   d   d   d   d   d   d   d   d

  b   d   d   d   d   d   d   d   d   d   1   1   1   1   1   1   1   1   1   1   2

  s   1   2   3   4   5   6   7   8   9   0   1   2   3   4   5   6   7   8   9   0

  1   0   0   0   0   0   0   0   0   0   0   1   0   0   1   0   1   1   0   1   1

  2   0   0   0   0   0   0   0   0   0   0   1   0   0   1   1   1   0   0   1   1

  3   0   0   0   0   0   0   0   0   0   0   1   0   0   1   1   1   1   0   0   1

  4   0   0   0   0   0   0   0   0   0   0   1   0   1   1   1   0   0   0   1   1

  5   0   0   0   0   0   0   0   0   0   0   1   0   1   1   0   0   1   0   1   1

  6   0   0   0   0   0   0   1   0   0   0   1   0   0   1   1   1   0   0   0   1

  7   0   0   0   0   0   0   1   0   0   0   1   0   0   1   0   1   1   0   0   1

  8   0   0   0   0   0   0   1   0   0   0   1   0   0   1   0   1   0   0   1   1

  9   0   0   0   0   0   0   1   0   0   0   1   0   1   1   0   0   0   0   1   1

10   0   0   0   0   0   0   0   0   0   1   1   0   0   1   1   0   0   0   1   1

PG

PG

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 2094 views
  • 3 likes
  • 4 in conversation