topic Re: Sampling to meet reference characteristics in SAS Programming

Sampling to meet reference characteristics

leex1514 — Thu, 20 May 2021 15:56:10 GMT

Hello SAS users,

I am trying to draw a sample from a dataset (data1) to meet a few reference points of the other data(data2). The reference points are 1) mean of variable 1 2) proportion of variable 2 (e.g. ,37% from 0/1 values). These variable 1 and 2 exist in data1. I would like to draw a sample from data 1 with mean of variable 1 and proportion of variable 2 that are same values as data 2. Is there a way?

Re: Sampling to meet reference characteristics

PaigeMiller — Thu, 20 May 2021 16:32:27 GMT

Random sampling? or otherwise?

Does the sample have to match EXACTLY the original data mean of variable 1 and match EXACTLY the percent of variable 2, or can it be "close"? If "close", how do you determine what is close?

Re: Sampling to meet reference characteristics

leex1514 — Thu, 20 May 2021 16:38:45 GMT

Hi Paige,

Random sampling is preferable but if I cannot meet the targets with random sampling, non-random sampling is ok. The targets need to be close (enough) within a standard error (of course the exact match is preferable).

thank you

Re: Sampling to meet reference characteristics

ballardw — Thu, 20 May 2021 17:30:07 GMT

How many records are there in your original data? How many records do you want in your sample?

If you only have 30 records to start with and want 5 in the sample it will likely be a bit difficult to get very close.

On the other hand if your data has 100,000 and you want 1000 in the sample you have a better chance.

Re: Sampling to meet reference characteristics

FreelanceReinh — Thu, 20 May 2021 17:59:23 GMT

Hello @leex1514 and welcome to the SAS Support Communities!

It's relatively easy to get the proportions of variable 2 in the sample as close as possible to the original proportions: Use ALLOC=PROP in the STRATA statement of PROC SURVEYSELECT.

With a sufficient number of random samples to choose from (if needed), the mean of variable 1 will be reasonably close enough to the original value, too.

Here's an example using SASHELP.HEART (featuring Systolic as variable 1 [var1] and Status='Dead' as variable 2 [var2]):

/* Create example data for demonstration */

data have;
set sashelp.heart(rename=(systolic=var1));
var2=(status='Dead');
run;

proc sort data=have;
by var2;
run; /* 5209 obs. */

/* Draw 500 stratified random samples of size n=1000 */

proc surveyselect data=have rep=500
method=srs n=1000
seed=2718 out=samp(drop=total--samplingweight);
strata var2 / alloc=prop;
run; /* 500,000 obs. */

/* Compute sample means of VAR1 */

proc summary data=samp nway;
class replicate;
var var1;
output out=sampmeans(drop=_:) mean=m;
run;

/* Determine the REPLICATE number of the optimum sample */

proc sql noprint;
select replicate into :repno
from (select *, abs(m-(select mean(var1) from have)) as _d from sampmeans
      having _d=min(_d));
quit;

/* Select the optimum sample */

data want(drop=replicate);
set samp;
where replicate=&repno;
run; /* 1000 obs. */

/* Compare means of VAR1 and VAR2 between the full dataset and the sample */

title 'Full dataset';
proc means data=have n mean;
var var1 var2;
run;

title 'Sample';
proc means data=want n mean;
var var1 var2;
run;
title;

Result:

Full dataset

Variable       N            Mean
--------------------------------
var1        5209     136.9095796
var2        5209       0.3822231
--------------------------------

Sample

Variable       N            Mean
--------------------------------
var1        1000     136.9100000
var2        1000       0.3820000
--------------------------------

(Note that the mean of var2 is the proportion of var2=1.)

Re: Sampling to meet reference characteristics

leex1514 — Thu, 20 May 2021 18:25:27 GMT

Thank you FreelanceReinhard. This looks very promising solution. One thing is that var1 and var2 are in data 1 and data 2. I need to sample from data 2 to meet the target of mean of var1 and proportion of var2 in the data1. Could you tweak your code to reflect that?

Re: Sampling to meet reference characteristics

leex1514 — Thu, 20 May 2021 18:56:08 GMT

Data 1 has N=99829 and data 2 N=129071. There are a lot of data points to sample from.

Re: Sampling to meet reference characteristics

FreelanceReinh — Thu, 20 May 2021 19:32:30 GMT

Here's an example that reflects your situation with two given datasets more closely, using arbitrary subsets HAVE1 and HAVE2 of SASHELP.HEART:

/* Create example data for demonstration */

data have1 have2;
set sashelp.heart(rename=(systolic=var1));
var2=(status='Dead');
if ~mod(_n_,3) then output have1; /* 1736 obs. */
else output have2; /* 3473 obs. */
run; 

proc sort data=have2;
by var2;
run;

/* Determine target proportions of VAR2 values (0, 1) */

proc freq data=have1 noprint;
tables var2 / out=targetprops(keep=var2 percent rename=(percent=_alloc_));
run;

/* Draw 500 stratified random samples of size n=1000 from HAVE2 */

proc surveyselect data=have2 rep=500
method=srs n=1000
seed=2718 out=samp(drop=total--samplingweight);
strata var2 / alloc=targetprops;
run; /* 500,000 obs. */

/* Compute sample means of VAR1 */

proc summary data=samp nway;
class replicate;
var var1;
output out=sampmeans(drop=_:) mean=m;
run;

/* Determine the REPLICATE number of the optimum sample */

proc sql noprint;
select replicate into :repno
from (select *, abs(m-(select mean(var1) from have1)) as d from sampmeans
      having d=min(d));
quit;

/* Select the optimum sample */

data want(drop=replicate);
set samp;
where replicate=&repno;
run; /* 1000 obs. */

/* Compare means of VAR1 and VAR2 between dataset HAVE1 and the sample */

title 'Dataset HAVE1';
proc means data=have1 n mean;
var var1 var2;
run;

title 'Sample from HAVE2';
proc means data=want n mean;
var var1 var2;
run;
title;

Result:

Dataset HAVE1

Variable       N            Mean
--------------------------------
var1        1736     136.8110599
var2        1736       0.3652074
--------------------------------

Sample from HAVE2

Variable       N            Mean
--------------------------------
var1        1000     136.8140000
var2        1000       0.3650000
--------------------------------

Caveat: If the means of var1 (and the frequency distributions of var2) differ too much between datasets HAVE1 and HAVE2, the "optimum" sample mean of var1 selected from 500 (or any feasible number of) samples from HAVE2 might not be "close enough" to the mean in HAVE1. In this situation a different sampling technique would need to be applied.

Re: Sampling to meet reference characteristics

leex1514 — Fri, 21 May 2021 16:55:51 GMT

The issue is that when we use SRS (simple random sampling) with replication, the mean of each replicated sample is very similar to the mean of data2. I wonder other sampling exist to mitigate this problem. In my original data, data 1 had mean of 357 and data2 had 350. All my replicated samples had close to 350ish means due to SRS. It met var2's proportion though. Perhaps 0/1 data is easier to meet the target than the continuous variable.

Re: Sampling to meet reference characteristics

FreelanceReinh — Fri, 21 May 2021 17:37:46 GMT

In this case it should be possible (under fairly mild assumptions) to replace the Laplace distribution governing the simple random sampling with a distribution such that (ideally) the expected value of the sample mean matches the target value, i.e., the mean of data1. This could be accomplished by assigning suitable weights to each element of data2 and using an appropriate sampling method. Obviously, there's a wide range of possible distributions, but some of them have likely undesirable properties (e.g., distributions with most of the probability mass concentrated to a narrow interval around the target value).

I will think about this over the weekend (CEST time zone) and hopefully find a satisfactory solution to this interesting problem. Maybe other forum experts who are more familiar with the non-default sampling methods of PROC SURVEYSELECT will also contribute their ideas.

Re: Sampling to meet reference characteristics

leex1514 — Fri, 21 May 2021 18:03:34 GMT

That sounds great! Thank you so much. I concerned about truncation of data distribution in the sample which does not resemble the data 1's properties (mean, std, and shape of distribution) when data 2 is somewhat different from data1. Also I wonder this new method can sample to follow closely to data 1's shape and remove some areas of data2 that have more access frequencies than data1.

Re: Sampling to meet reference characteristics

FreelanceReinh — Mon, 24 May 2021 09:46:29 GMT

What an interesting weekend that was! Many thanks, @leex1514, for starting this inspiring thread.

I've executed the plan that I outlined in my previous post.

Let's again use subsets of SASHELP.HEART for demonstration.

data have1 have2;
set sashelp.heart(rename=(systolic=var1));
var2=(status='Dead');
if _n_<=2000 then output have1; /* 2000 obs. */
else output have2; /* 3209 obs. */
run; 

proc sort data=have2;
by var2;
run;

proc summary data=have1;
var var1;
output out=stats1(drop=_:) mean=m;
run; /* Mean 133.9 */

proc means data=have2;
var var1;
run; /* Mean 138.785... */

%let r=1000; /* intended size of the random sample from HAVE2 */

This is a case similar to what you described: Simple, stratified random sampling from HAVE2, even with thousands of replications, doesn't produce sample means close to the target value m=133.9. They are all too large, like >136.

So, how can sampling weights be defined in order to move the sample mean towards the target? For the time being I went for computational simplicity and set out to define weights as linear functions of VAR1, i.e., weights of the form a*VAR1+b with constants a and b. The simple random sampling applied so far is actually a special case of this: a=0, b>0 (arbitrary constant).

In a preliminary approach I disregarded the stratification in the definition of the weights. This simplifies the calculation of "optimum" coefficients a and b.

Set the weight for the target value m=133.9 to the arbitrary positive value 1, i.e., a*m+b=1.
The weight for a general VAR1 value is then a*VAR1+b = a*VAR1+1-a*m = 1+(VAR1-m)*a.
The requirement that weights must be positive imposes one of two restrictions on VAR1:
VAR1 > m-1/a if a>0,
VAR1 < m-1/a if a<0 (this applies to our example).
This is plausible when you imagine the impact of outliers (to be "weighted down") on a linear weighting scheme. Nonlinear (e.g., sigmoid-shaped), everywhere positive weight functions could avoid negative weights by definition. I will give those a try if negative weights abound with your real data. If there are only a few of them, I think it should be acceptable to simply exclude these observations.
The requirement that the weighted mean of the n VAR1 values of HAVE2 be equal to the target value m yields (by a simple calculation):
a=(n*m-sum(VAR1))/(n*m**2-2*sum(VAR1)*m+uss(VAR1)).

/* Compute the coefficients a and b and the limit for VAR1 */

proc summary data=have2;
var var1;
output out=stats2(drop=_:) n=n sum=s uss=q;
run;

data coeff;
set stats1;
set stats2;
a=(n*m-s)/(n*m**2-2*s*m+q);
b=1-a*m;
limit=m-1/a;
run; /* a=-0.007818..., b=2.04689..., limit=261.802... */

/* Compute the weights */

data have2_wgt;
if _n_=1 then set coeff(keep=a b);
set have2;
w=a*var1+b;
run;

Seven VAR1 values in HAVE2 get negative weights because they exceed the (here:) upper limit: max(VAR1)=300 > 261.802.

/* Check that the weighted mean of VAR1 equals the target value */

proc sql;
select sum(w*var1)/sum(w) as wmean
from have2_wgt;
quit; /* indeed: 133.9 */

proc means data=have2_wgt;
weight w;
var var1;
run; /* 133.964... -- the minor deviation is due to the replacement of negative weights with zero by PROC MEANS.
        It's small enough so that the sample replication technique can deal with it. Another option would be to
        re-adjust the coefficients a and b after exclusion of the observations with negative weights. */

/* Determine target proportions of VAR2 values (0, 1) */

proc freq data=have1 noprint;
tables var2 / out=targetprops(keep=var2 percent rename=(percent=_alloc_));
run;

/* Draw 500 stratified weighted random samples of size r=1000 from HAVE2_WGT */

proc surveyselect data=have2_wgt rep=500
method=pps n=&r
seed=2718 out=samp;
size w;
strata var2 / alloc=targetprops;
run; /* 500,000 obs. The 7 observations (0.2%) with negative weights are automatically excluded
        by PROC SURVEYSELECT as well. */

I used the basic PPS method to create the weighted random samples. There are other comparable methods available in PROC SURVEYSELECT which differ from PPS, e.g., in the joint probabilities of selection.

/* Compute sample means of VAR1 */

proc summary data=samp nway;
class replicate;
var var1;
output out=sampmeans(drop=_:) mean=ms;
run;

/* Determine the REPLICATE number of the optimum sample */

proc sql noprint;
select replicate into :repno
from (select *, abs(ms-(select m from stats1)) as d from sampmeans
      having d=min(d));
quit;

/* Select the optimum sample */

data want(drop=replicate);
set samp;
where replicate=&repno;
run; /* 1000 obs. */

/* Compare means of VAR1 and VAR2 between dataset HAVE1 and the sample */

title 'Dataset HAVE1';
proc means data=have1 n mean;
var var1 var2;
run;

title 'Sample from HAVE2';
proc means data=want n mean;
var var1 var2;
run;
title;

Result:

Dataset HAVE1

Variable       N            Mean
--------------------------------
var1        2000     133.9000000
var2        2000       0.3685000
--------------------------------

Sample from HAVE2

Variable       N            Mean
--------------------------------
var1        1000     133.8930000
var2        1000       0.3680000
--------------------------------

For the sample data from SASHELP.HEART this already works quite well.

The next step is to take the stratification into account in the definition of the weights. I'm going to present that improved approach in a separate post.

Re: Sampling to meet reference characteristics

FreelanceReinh — Mon, 24 May 2021 15:05:43 GMT

Now for the improved approach that takes the stratification into account in the definition of the sampling weights.

/* Create sample data (same as before) */

data have1 have2;
set sashelp.heart(rename=(systolic=var1));
var2=(status='Dead');
if _n_<=2000 then output have1; /* 2000 obs. */
else output have2; /* 3209 obs. */
run; 

proc sort data=have2;
by var2;
run;

proc summary data=have1;
var var1;
output out=stats1(drop=_:) mean=m;
run; /* Mean 133.9 */

%let r=1000; /* intended size of the random sample from HAVE2 */

/* Determine target proportions of VAR2 values (0, 1) */

proc freq data=have1 noprint;
tables var2 / out=targetprops(keep=var2 percent rename=(percent=_alloc_));
run;

/* Determine sample sizes r0, r1 for the two strata (VAR2=0, VAR2=1) */

data sample_sizes(keep=r:);
set targetprops;
retain r &r r0;
if var2=0 then r0=round(r*_alloc_/100);
if var2=1;
r1=r-r0;
run;

The computation of "optimum" coefficients a and b for the weights a*VAR1+b is now a bit more complicated in the case of a. While the first requirement a*m+b=1 is the same as before, we now want a weighted mean of weighted means (!) to be equal to m: (r0*m0 + r1*m1)/r = m, where m0 and m1 are the weighted means of the VAR1 values in the two strata (VAR2=0, VAR2=1) of HAVE2. This boils down to solving a quadratic equation in a, involving summary statistics of VAR1 in the strata.

/* Compute the coefficients a and b and the limit for VAR1 */

proc summary data=have2 nway;
where var2=0;
var var1;
output out=stats2_0(drop=_:) n=t0 sum=s0 uss=q0;
run;

proc summary data=have2 nway;
where var2=1;
var var1;
output out=stats2_1(drop=_:) n=t1 sum=s1 uss=q1;
run;

data coeff;
set stats1;
set sample_sizes;
set stats2_0;
set stats2_1;
d =  r0*(q0-m*s0)*(s1-m*t1)
   + r1*(q1-m*s1)*(s0-m*t0)
   -r*m*(s0-m*t0)*(s1-m*t1);
c=(  r0*(t1*(q0-m*s0) + s0*(s1-m*t1))
   + r1*(t0*(q1-m*s1) + s1*(s0-m*t0))
   -r*m*(t0*(s1-m*t1) + t1*(s0-m*t0))) / d;
d=(r0*s0*t1 + r1*s1*t0 - r*m*t0*t1)/d;
a=-c/2 + sqrt(c**2/4-d); /* Check the result to decide between +/- sqrt(...). */
b=1-a*m;
limit=m-1/a;
run; /* a=-0.007965..., b=2.06659..., limit=259.439... */

/* Compute the weights */

data have2_wgt;
if _n_=1 then set coeff(keep=a b);
set have2;
w=a*var1+b;
run;

This time 9 of the 3209 VAR1 values (0.3%) in HAVE2 get negative weights because they exceed the upper limit of 259.439.

/* Check that the weighted mean of weighted means of VAR1 equals the target value */

proc sql;
select ( r0*(select sum(w*var1)/sum(w)
             from have2_wgt
             where var2=0)
        +r1*(select sum(w*var1)/sum(w)
             from have2_wgt
             where var2=1)
       )/r
from sample_sizes;
quit; /* indeed: 133.9 */

/* Draw 500 stratified weighted random samples of size r=1000 from HAVE2_WGT */

proc surveyselect data=have2_wgt rep=500
method=pps n=&r
seed=2718 out=samp;
size w;
strata var2 / alloc=targetprops;
run; /* 500,000 obs. (9 obs. of HAVE2_WGT were excluded due to negative weights) */

/* Compute sample means of VAR1 */

proc summary data=samp nway;
class replicate;
var var1;
output out=sampmeans(drop=_:) mean=ms;
run;

/* Determine the REPLICATE number of the optimum sample */

proc sql noprint;
select replicate into :repno
from (select *, abs(ms-(select m from stats1)) as d from sampmeans
      having d=min(d));
quit;

/* Select the optimum sample */

data want(drop=replicate);
set samp;
where replicate=&repno;
run; /* 1000 obs. */

/* Compare means of VAR1 and VAR2 between dataset HAVE1 and the sample */

title 'Dataset HAVE1';
proc means data=have1 n mean;
var var1 var2;
run;

title 'Sample from HAVE2';
proc means data=want n mean;
var var1 var2;
run;
title;

/* Compare shapes of distributions of VAR1 in HAVE1 and the sample */

data cmp(keep=source var1);
set have1 want(in=s);
if s then source='Sample from HAVE2';
else source='HAVE1';
run;

ods graphics on;

proc univariate data=cmp;
class source;
var var1;
histogram;
run;

Results:

Dataset HAVE1

Variable       N            Mean
--------------------------------
var1        2000     133.9000000
var2        2000       0.3685000
--------------------------------

Sample from HAVE2

Variable       N            Mean
--------------------------------
var1        1000     133.9000000
var2        1000       0.3680000
--------------------------------

So the refined approach performed even better on the SASHELP.HEART sample data.

Edit: Added comment to assignment statement a=....

Re: Sampling to meet reference characteristics

leex1514 — Thu, 27 May 2021 23:09:11 GMT

Thank you so much for your code! I have tried it but my final sample's means are not meeting targets (mean of var1 is way too high and mean of var2=0) . I found "samp" does not contain var2=1 (only contain var2=0). Maybe the proc surveyselect step is not working properly? Or my data won't work no matter what I do?

Re: Sampling to meet reference characteristics

FreelanceReinh — Fri, 28 May 2021 08:02:40 GMT

@leex1514 wrote:

Thank you so much for your code! I have tried it but my final sample's means are not meeting targets (mean of var1 is way too high and mean of var2=0) . I found "samp" does not contain var2=1 (only contain var2=0). Maybe the proc surveyselect step is not working properly? Or my data won't work no matter what I do?

@leex1514: I'm sorry and surprised to read this. Intuitively, I think the large deviations from the expected results suggest that a relatively small (!) change to your code, e.g., an adaptation to your data structure, might be sufficient to resolve the issue.

Let's start our investigation with the simple part, i.e. var2, and examine the datasets TARGETPROPS, SAMPLE_SIZES and SAMP created with the most recent approach.

proc print data=targetprops;
run;

proc print data=sample_sizes;
run;

proc freq data=samp noprint;
tables replicate*var2 / out=cnt;
run;

proc print data=cnt(obs=10);
run;

Using the SASHELP.HEART sample data, the three PROC PRINT steps above yield:

Obs    var2    _alloc_

 1       0      63.15
 2       1      36.85

Obs      r      r0     r1

 1     1000    632    368

 Obs    Replicate    var2    COUNT    PERCENT

   1        1          0      632      0.1264
   2        1          1      368      0.0736
   3        2          0      632      0.1264
   4        2          1      368      0.0736
   5        3          0      632      0.1264
   6        3          1      368      0.0736
   7        4          0      632      0.1264
   8        4          1      368      0.0736
   9        5          0      632      0.1264
  10        5          1      368      0.0736

The proportions of {var2=0} and {var2=1} are necessarily constant in the replicates in SAMP and identical (up to rounding) to the proportions found in dataset HAVE1 (stored in TARGETPROPS).

Please run these PROC PRINT steps on your data and post the results.

Re: Sampling to meet reference characteristics

leex1514 — Fri, 28 May 2021 13:41:22 GMT

I still have no 1s of var2 in samp. I don't know why proc survey does not pick 0 and 1 in the samp.

Re: Sampling to meet reference characteristics

leex1514 — Fri, 28 May 2021 13:49:28 GMT

Here are the outputs (var2=w).

Obs	W	_alloc_
1	0	62.8524
2	1	37.1476

Obs	r	r0	r1
1	5000	3143	1857

Obs	Replicate	W	COUNT	PERCENT
1	1	0	3143	0.2
2	2	0	3143	0.2
3	3	0	3143	0.2
4	4	0	3143	0.2
5	5	0	3143	0.2
6	6	0	3143	0.2
7	7	0	3143	0.2
8	8	0	3143	0.2
9	9	0	3143	0.2
10	10	0	3143	0.2

Re: Sampling to meet reference characteristics

FreelanceReinh — Fri, 28 May 2021 14:06:33 GMT

@leex1514 wrote:

Here are the outputs (var2=w).

So, the real name of your "var2" happens to be w -- which is the name I arbitrarily chose for the weight variable? This trivial name conflict might explain the nonsensical results. Can you replace the w from the code I provided by a different name, say, _w or whatever does not occur in your data elsewhere?

Re: Sampling to meet reference characteristics

leex1514 — Fri, 28 May 2021 14:38:28 GMT

For that reason, I changed weight w to w1 and leave var2 as w. But I will change the var2 name to be something completely different to avoid confusion.

Re: Sampling to meet reference characteristics

FreelanceReinh — Fri, 28 May 2021 15:28:49 GMT

Thanks. With similar, but different variable names for "VAR2" and the weight variable no name conflict should occur. If your check of the variable names does not solve the problem, the next step will be to examine the log of the PROC SURVEYSELECT step, i.e., the section which looks like this:

67   proc surveyselect data=have2_wgt rep=500
68   method=pps n=&r
69   seed=2718 out=samp;
70   size w;
71   strata var2 / alloc=targetprops;
72   run;

NOTE: 9 sampling units were omitted due to missing or nonpositive size measures.
NOTE: The above message was for the following stratum:
      var2=1.
NOTE: The data set WORK.SAMP has 500000 observations and 28 variables.
NOTE: PROCEDURE SURVEYSELECT used (Total process time):
      real time           1.09 seconds
      cpu time            1.10 seconds

(It's important to use the "</>" (Insert Code) button to post the log in order to preserve formatting.)

As shown in the example, not only warnings and errors, but also notes in the log can indicate certain issues such as the omission of sampling units.

With rep=500 and &r=5000 the number of observations in WORK.SAMP should be 2500000. But your PROC FREQ output dataset already revealed that this is not the case: It appears that your WORK.SAMP has only about 1571500 (=500*3143) observations.