Hello SAS users,
I am trying to draw a sample from a dataset (data1) to meet a few reference points of the other data(data2). The reference points are 1) mean of variable 1 2) proportion of variable 2 (e.g. ,37% from 0/1 values). These variable 1 and 2 exist in data1. I would like to draw a sample from data 1 with mean of variable 1 and proportion of variable 2 that are same values as data 2. Is there a way?
Random sampling? or otherwise?
Does the sample have to match EXACTLY the original data mean of variable 1 and match EXACTLY the percent of variable 2, or can it be "close"? If "close", how do you determine what is close?
Hi Paige,
Random sampling is preferable but if I cannot meet the targets with random sampling, non-random sampling is ok. The targets need to be close (enough) within a standard error (of course the exact match is preferable).
thank you
How many records are there in your original data? How many records do you want in your sample?
If you only have 30 records to start with and want 5 in the sample it will likely be a bit difficult to get very close.
On the other hand if your data has 100,000 and you want 1000 in the sample you have a better chance.
Data 1 has N=99829 and data 2 N=129071. There are a lot of data points to sample from.
Hello @leex1514 and welcome to the SAS Support Communities!
It's relatively easy to get the proportions of variable 2 in the sample as close as possible to the original proportions: Use ALLOC=PROP in the STRATA statement of PROC SURVEYSELECT.
With a sufficient number of random samples to choose from (if needed), the mean of variable 1 will be reasonably close enough to the original value, too.
Here's an example using SASHELP.HEART (featuring Systolic as variable 1 [var1] and Status='Dead' as variable 2 [var2]):
/* Create example data for demonstration */
data have;
set sashelp.heart(rename=(systolic=var1));
var2=(status='Dead');
run;
proc sort data=have;
by var2;
run; /* 5209 obs. */
/* Draw 500 stratified random samples of size n=1000 */
proc surveyselect data=have rep=500
method=srs n=1000
seed=2718 out=samp(drop=total--samplingweight);
strata var2 / alloc=prop;
run; /* 500,000 obs. */
/* Compute sample means of VAR1 */
proc summary data=samp nway;
class replicate;
var var1;
output out=sampmeans(drop=_:) mean=m;
run;
/* Determine the REPLICATE number of the optimum sample */
proc sql noprint;
select replicate into :repno
from (select *, abs(m-(select mean(var1) from have)) as _d from sampmeans
having _d=min(_d));
quit;
/* Select the optimum sample */
data want(drop=replicate);
set samp;
where replicate=&repno;
run; /* 1000 obs. */
/* Compare means of VAR1 and VAR2 between the full dataset and the sample */
title 'Full dataset';
proc means data=have n mean;
var var1 var2;
run;
title 'Sample';
proc means data=want n mean;
var var1 var2;
run;
title;
Result:
Full dataset Variable N Mean -------------------------------- var1 5209 136.9095796 var2 5209 0.3822231 -------------------------------- Sample Variable N Mean -------------------------------- var1 1000 136.9100000 var2 1000 0.3820000 --------------------------------
(Note that the mean of var2 is the proportion of var2=1.)
Thank you FreelanceReinhard. This looks very promising solution. One thing is that var1 and var2 are in data 1 and data 2. I need to sample from data 2 to meet the target of mean of var1 and proportion of var2 in the data1. Could you tweak your code to reflect that?
Here's an example that reflects your situation with two given datasets more closely, using arbitrary subsets HAVE1 and HAVE2 of SASHELP.HEART:
/* Create example data for demonstration */
data have1 have2;
set sashelp.heart(rename=(systolic=var1));
var2=(status='Dead');
if ~mod(_n_,3) then output have1; /* 1736 obs. */
else output have2; /* 3473 obs. */
run;
proc sort data=have2;
by var2;
run;
/* Determine target proportions of VAR2 values (0, 1) */
proc freq data=have1 noprint;
tables var2 / out=targetprops(keep=var2 percent rename=(percent=_alloc_));
run;
/* Draw 500 stratified random samples of size n=1000 from HAVE2 */
proc surveyselect data=have2 rep=500
method=srs n=1000
seed=2718 out=samp(drop=total--samplingweight);
strata var2 / alloc=targetprops;
run; /* 500,000 obs. */
/* Compute sample means of VAR1 */
proc summary data=samp nway;
class replicate;
var var1;
output out=sampmeans(drop=_:) mean=m;
run;
/* Determine the REPLICATE number of the optimum sample */
proc sql noprint;
select replicate into :repno
from (select *, abs(m-(select mean(var1) from have1)) as d from sampmeans
having d=min(d));
quit;
/* Select the optimum sample */
data want(drop=replicate);
set samp;
where replicate=&repno;
run; /* 1000 obs. */
/* Compare means of VAR1 and VAR2 between dataset HAVE1 and the sample */
title 'Dataset HAVE1';
proc means data=have1 n mean;
var var1 var2;
run;
title 'Sample from HAVE2';
proc means data=want n mean;
var var1 var2;
run;
title;
Result:
Dataset HAVE1 Variable N Mean -------------------------------- var1 1736 136.8110599 var2 1736 0.3652074 -------------------------------- Sample from HAVE2 Variable N Mean -------------------------------- var1 1000 136.8140000 var2 1000 0.3650000 --------------------------------
Caveat: If the means of var1 (and the frequency distributions of var2) differ too much between datasets HAVE1 and HAVE2, the "optimum" sample mean of var1 selected from 500 (or any feasible number of) samples from HAVE2 might not be "close enough" to the mean in HAVE1. In this situation a different sampling technique would need to be applied.
The issue is that when we use SRS (simple random sampling) with replication, the mean of each replicated sample is very similar to the mean of data2. I wonder other sampling exist to mitigate this problem. In my original data, data 1 had mean of 357 and data2 had 350. All my replicated samples had close to 350ish means due to SRS. It met var2's proportion though. Perhaps 0/1 data is easier to meet the target than the continuous variable.
In this case it should be possible (under fairly mild assumptions) to replace the Laplace distribution governing the simple random sampling with a distribution such that (ideally) the expected value of the sample mean matches the target value, i.e., the mean of data1. This could be accomplished by assigning suitable weights to each element of data2 and using an appropriate sampling method. Obviously, there's a wide range of possible distributions, but some of them have likely undesirable properties (e.g., distributions with most of the probability mass concentrated to a narrow interval around the target value).
I will think about this over the weekend (CEST time zone) and hopefully find a satisfactory solution to this interesting problem. Maybe other forum experts who are more familiar with the non-default sampling methods of PROC SURVEYSELECT will also contribute their ideas.
That sounds great! Thank you so much. I concerned about truncation of data distribution in the sample which does not resemble the data 1's properties (mean, std, and shape of distribution) when data 2 is somewhat different from data1. Also I wonder this new method can sample to follow closely to data 1's shape and remove some areas of data2 that have more access frequencies than data1.
What an interesting weekend that was! Many thanks, @leex1514, for starting this inspiring thread.
I've executed the plan that I outlined in my previous post.
Let's again use subsets of SASHELP.HEART for demonstration.
data have1 have2;
set sashelp.heart(rename=(systolic=var1));
var2=(status='Dead');
if _n_<=2000 then output have1; /* 2000 obs. */
else output have2; /* 3209 obs. */
run;
proc sort data=have2;
by var2;
run;
proc summary data=have1;
var var1;
output out=stats1(drop=_:) mean=m;
run; /* Mean 133.9 */
proc means data=have2;
var var1;
run; /* Mean 138.785... */
%let r=1000; /* intended size of the random sample from HAVE2 */
This is a case similar to what you described: Simple, stratified random sampling from HAVE2, even with thousands of replications, doesn't produce sample means close to the target value m=133.9. They are all too large, like >136.
So, how can sampling weights be defined in order to move the sample mean towards the target? For the time being I went for computational simplicity and set out to define weights as linear functions of VAR1, i.e., weights of the form a*VAR1+b with constants a and b. The simple random sampling applied so far is actually a special case of this: a=0, b>0 (arbitrary constant).
In a preliminary approach I disregarded the stratification in the definition of the weights. This simplifies the calculation of "optimum" coefficients a and b.
/* Compute the coefficients a and b and the limit for VAR1 */
proc summary data=have2;
var var1;
output out=stats2(drop=_:) n=n sum=s uss=q;
run;
data coeff;
set stats1;
set stats2;
a=(n*m-s)/(n*m**2-2*s*m+q);
b=1-a*m;
limit=m-1/a;
run; /* a=-0.007818..., b=2.04689..., limit=261.802... */
/* Compute the weights */
data have2_wgt;
if _n_=1 then set coeff(keep=a b);
set have2;
w=a*var1+b;
run;
Seven VAR1 values in HAVE2 get negative weights because they exceed the (here:) upper limit: max(VAR1)=300 > 261.802.
/* Check that the weighted mean of VAR1 equals the target value */
proc sql;
select sum(w*var1)/sum(w) as wmean
from have2_wgt;
quit; /* indeed: 133.9 */
proc means data=have2_wgt;
weight w;
var var1;
run; /* 133.964... -- the minor deviation is due to the replacement of negative weights with zero by PROC MEANS.
It's small enough so that the sample replication technique can deal with it. Another option would be to
re-adjust the coefficients a and b after exclusion of the observations with negative weights. */
/* Determine target proportions of VAR2 values (0, 1) */
proc freq data=have1 noprint;
tables var2 / out=targetprops(keep=var2 percent rename=(percent=_alloc_));
run;
/* Draw 500 stratified weighted random samples of size r=1000 from HAVE2_WGT */
proc surveyselect data=have2_wgt rep=500
method=pps n=&r
seed=2718 out=samp;
size w;
strata var2 / alloc=targetprops;
run; /* 500,000 obs. The 7 observations (0.2%) with negative weights are automatically excluded
by PROC SURVEYSELECT as well. */
I used the basic PPS method to create the weighted random samples. There are other comparable methods available in PROC SURVEYSELECT which differ from PPS, e.g., in the joint probabilities of selection.
/* Compute sample means of VAR1 */
proc summary data=samp nway;
class replicate;
var var1;
output out=sampmeans(drop=_:) mean=ms;
run;
/* Determine the REPLICATE number of the optimum sample */
proc sql noprint;
select replicate into :repno
from (select *, abs(ms-(select m from stats1)) as d from sampmeans
having d=min(d));
quit;
/* Select the optimum sample */
data want(drop=replicate);
set samp;
where replicate=&repno;
run; /* 1000 obs. */
/* Compare means of VAR1 and VAR2 between dataset HAVE1 and the sample */
title 'Dataset HAVE1';
proc means data=have1 n mean;
var var1 var2;
run;
title 'Sample from HAVE2';
proc means data=want n mean;
var var1 var2;
run;
title;
Result:
Dataset HAVE1 Variable N Mean -------------------------------- var1 2000 133.9000000 var2 2000 0.3685000 -------------------------------- Sample from HAVE2 Variable N Mean -------------------------------- var1 1000 133.8930000 var2 1000 0.3680000 --------------------------------
For the sample data from SASHELP.HEART this already works quite well.
The next step is to take the stratification into account in the definition of the weights. I'm going to present that improved approach in a separate post.
Now for the improved approach that takes the stratification into account in the definition of the sampling weights.
/* Create sample data (same as before) */
data have1 have2;
set sashelp.heart(rename=(systolic=var1));
var2=(status='Dead');
if _n_<=2000 then output have1; /* 2000 obs. */
else output have2; /* 3209 obs. */
run;
proc sort data=have2;
by var2;
run;
proc summary data=have1;
var var1;
output out=stats1(drop=_:) mean=m;
run; /* Mean 133.9 */
%let r=1000; /* intended size of the random sample from HAVE2 */
/* Determine target proportions of VAR2 values (0, 1) */
proc freq data=have1 noprint;
tables var2 / out=targetprops(keep=var2 percent rename=(percent=_alloc_));
run;
/* Determine sample sizes r0, r1 for the two strata (VAR2=0, VAR2=1) */
data sample_sizes(keep=r:);
set targetprops;
retain r &r r0;
if var2=0 then r0=round(r*_alloc_/100);
if var2=1;
r1=r-r0;
run;
The computation of "optimum" coefficients a and b for the weights a*VAR1+b is now a bit more complicated in the case of a. While the first requirement a*m+b=1 is the same as before, we now want a weighted mean of weighted means (!) to be equal to m: (r0*m0 + r1*m1)/r = m, where m0 and m1 are the weighted means of the VAR1 values in the two strata (VAR2=0, VAR2=1) of HAVE2. This boils down to solving a quadratic equation in a, involving summary statistics of VAR1 in the strata.
/* Compute the coefficients a and b and the limit for VAR1 */
proc summary data=have2 nway;
where var2=0;
var var1;
output out=stats2_0(drop=_:) n=t0 sum=s0 uss=q0;
run;
proc summary data=have2 nway;
where var2=1;
var var1;
output out=stats2_1(drop=_:) n=t1 sum=s1 uss=q1;
run;
data coeff;
set stats1;
set sample_sizes;
set stats2_0;
set stats2_1;
d = r0*(q0-m*s0)*(s1-m*t1)
+ r1*(q1-m*s1)*(s0-m*t0)
-r*m*(s0-m*t0)*(s1-m*t1);
c=( r0*(t1*(q0-m*s0) + s0*(s1-m*t1))
+ r1*(t0*(q1-m*s1) + s1*(s0-m*t0))
-r*m*(t0*(s1-m*t1) + t1*(s0-m*t0))) / d;
d=(r0*s0*t1 + r1*s1*t0 - r*m*t0*t1)/d;
a=-c/2 + sqrt(c**2/4-d); /* Check the result to decide between +/- sqrt(...). */
b=1-a*m;
limit=m-1/a;
run; /* a=-0.007965..., b=2.06659..., limit=259.439... */
/* Compute the weights */
data have2_wgt;
if _n_=1 then set coeff(keep=a b);
set have2;
w=a*var1+b;
run;
This time 9 of the 3209 VAR1 values (0.3%) in HAVE2 get negative weights because they exceed the upper limit of 259.439.
/* Check that the weighted mean of weighted means of VAR1 equals the target value */
proc sql;
select ( r0*(select sum(w*var1)/sum(w)
from have2_wgt
where var2=0)
+r1*(select sum(w*var1)/sum(w)
from have2_wgt
where var2=1)
)/r
from sample_sizes;
quit; /* indeed: 133.9 */
/* Draw 500 stratified weighted random samples of size r=1000 from HAVE2_WGT */
proc surveyselect data=have2_wgt rep=500
method=pps n=&r
seed=2718 out=samp;
size w;
strata var2 / alloc=targetprops;
run; /* 500,000 obs. (9 obs. of HAVE2_WGT were excluded due to negative weights) */
/* Compute sample means of VAR1 */
proc summary data=samp nway;
class replicate;
var var1;
output out=sampmeans(drop=_:) mean=ms;
run;
/* Determine the REPLICATE number of the optimum sample */
proc sql noprint;
select replicate into :repno
from (select *, abs(ms-(select m from stats1)) as d from sampmeans
having d=min(d));
quit;
/* Select the optimum sample */
data want(drop=replicate);
set samp;
where replicate=&repno;
run; /* 1000 obs. */
/* Compare means of VAR1 and VAR2 between dataset HAVE1 and the sample */
title 'Dataset HAVE1';
proc means data=have1 n mean;
var var1 var2;
run;
title 'Sample from HAVE2';
proc means data=want n mean;
var var1 var2;
run;
title;
/* Compare shapes of distributions of VAR1 in HAVE1 and the sample */
data cmp(keep=source var1);
set have1 want(in=s);
if s then source='Sample from HAVE2';
else source='HAVE1';
run;
ods graphics on;
proc univariate data=cmp;
class source;
var var1;
histogram;
run;
Results:
Dataset HAVE1 Variable N Mean -------------------------------- var1 2000 133.9000000 var2 2000 0.3685000 -------------------------------- Sample from HAVE2 Variable N Mean -------------------------------- var1 1000 133.9000000 var2 1000 0.3680000 --------------------------------
So the refined approach performed even better on the SASHELP.HEART sample data.
Edit: Added comment to assignment statement a=....
Thank you so much for your code! I have tried it but my final sample's means are not meeting targets (mean of var1 is way too high and mean of var2=0) . I found "samp" does not contain var2=1 (only contain var2=0). Maybe the proc surveyselect step is not working properly? Or my data won't work no matter what I do?
@leex1514 wrote:
Thank you so much for your code! I have tried it but my final sample's means are not meeting targets (mean of var1 is way too high and mean of var2=0) . I found "samp" does not contain var2=1 (only contain var2=0). Maybe the proc surveyselect step is not working properly? Or my data won't work no matter what I do?
@leex1514: I'm sorry and surprised to read this. Intuitively, I think the large deviations from the expected results suggest that a relatively small (!) change to your code, e.g., an adaptation to your data structure, might be sufficient to resolve the issue.
Let's start our investigation with the simple part, i.e. var2, and examine the datasets TARGETPROPS, SAMPLE_SIZES and SAMP created with the most recent approach.
proc print data=targetprops;
run;
proc print data=sample_sizes;
run;
proc freq data=samp noprint;
tables replicate*var2 / out=cnt;
run;
proc print data=cnt(obs=10);
run;
Using the SASHELP.HEART sample data, the three PROC PRINT steps above yield:
Obs var2 _alloc_ 1 0 63.15 2 1 36.85
Obs r r0 r1 1 1000 632 368
Obs Replicate var2 COUNT PERCENT 1 1 0 632 0.1264 2 1 1 368 0.0736 3 2 0 632 0.1264 4 2 1 368 0.0736 5 3 0 632 0.1264 6 3 1 368 0.0736 7 4 0 632 0.1264 8 4 1 368 0.0736 9 5 0 632 0.1264 10 5 1 368 0.0736
The proportions of {var2=0} and {var2=1} are necessarily constant in the replicates in SAMP and identical (up to rounding) to the proportions found in dataset HAVE1 (stored in TARGETPROPS).
Please run these PROC PRINT steps on your data and post the results.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.