Re: Sampling to meet reference characteristics - Page 2

leex1514 · Posted 05-28-2021 09:41 AM

I still have no 1s of var2 in samp. I don't know why proc survey does not pick 0 and 1 in the samp.

leex1514 · Posted 05-28-2021 09:49 AM

Here are the outputs (var2=w).

Obs	W	_alloc_
1	0	62.8524
2	1	37.1476

Obs	r	r0	r1
1	5000	3143	1857

Obs	Replicate	W	COUNT	PERCENT
1	1	0	3143	0.2
2	2	0	3143	0.2
3	3	0	3143	0.2
4	4	0	3143	0.2
5	5	0	3143	0.2
6	6	0	3143	0.2
7	7	0	3143	0.2
8	8	0	3143	0.2
9	9	0	3143	0.2
10	10	0	3143	0.2

FreelanceReinh · Posted 05-28-2021 10:06 AM

@leex1514 wrote:

Here are the outputs (var2=w).

So, the real name of your "var2" happens to be w -- which is the name I arbitrarily chose for the weight variable? This trivial name conflict might explain the nonsensical results. Can you replace the w from the code I provided by a different name, say, _w or whatever does not occur in your data elsewhere?

leex1514 · Posted 05-28-2021 10:38 AM

For that reason, I changed weight w to w1 and leave var2 as w. But I will change the var2 name to be something completely different to avoid confusion.

FreelanceReinh · Posted 05-28-2021 11:28 AM

Thanks. With similar, but different variable names for "VAR2" and the weight variable no name conflict should occur. If your check of the variable names does not solve the problem, the next step will be to examine the log of the PROC SURVEYSELECT step, i.e., the section which looks like this:

67   proc surveyselect data=have2_wgt rep=500
68   method=pps n=&r
69   seed=2718 out=samp;
70   size w;
71   strata var2 / alloc=targetprops;
72   run;

NOTE: 9 sampling units were omitted due to missing or nonpositive size measures.
NOTE: The above message was for the following stratum:
      var2=1.
NOTE: The data set WORK.SAMP has 500000 observations and 28 variables.
NOTE: PROCEDURE SURVEYSELECT used (Total process time):
      real time           1.09 seconds
      cpu time            1.10 seconds

(It's important to use the "</>" (Insert Code) button to post the log in order to preserve formatting.)

As shown in the example, not only warnings and errors, but also notes in the log can indicate certain issues such as the omission of sampling units.

With rep=500 and &r=5000 the number of observations in WORK.SAMP should be 2500000. But your PROC FREQ output dataset already revealed that this is not the case: It appears that your WORK.SAMP has only about 1571500 (=500*3143) observations.

leex1514 · Posted 05-28-2021 02:18 PM

I found my error of var2. It occurred when I applied coefficients a and b. I fixed it and ran the whole analysis but still my var1's mean is way higher than the target mean. Is this problem inheriting in my data, nothing much can be done?

Dataset R4_502 /*HAVE1*/

The MEANS Procedure

Variable	N	Mean
readingscalescore1	228928	357.0903559
white	228989	0.3714764

Sample from R4_503 /*HAVE2*/

The MEANS Procedure

Variable	N	Mean
readingscalescore1	5000	387.1486000
white	5000	0.3714000

FreelanceReinh · Posted 05-28-2021 05:55 PM

@leex1514 wrote:

... still my var1's mean is way higher than the target mean. Is this problem inheriting in my data, nothing much can be done?

Dataset R4_502 /*HAVE1*/

The MEANS Procedure

Variable

N

Mean

readingscalescore1

228928

357.0903559

white

228989

0.3714764

Sample from R4_503 /*HAVE2*/

The MEANS Procedure

Variable

N

Mean

readingscalescore1

5000

387.1486000

white

5000

0.3714000

I'd be surprised if the variability of your data was so large that no better match would be possible (with 500 replications). So, is the 387.1486 already the result of picking the mean closest to the target (357.09...)?

After the log of the PROC SURVEYSELECT step (especially the number of sampling units omitted, if any) I would look at the distribution of the 500 sample means:

proc means data=sampmeans;
var ms;
run;

Result for the test data:

                      Analysis Variable : ms

  N            Mean         Std Dev         Minimum         Maximum
-------------------------------------------------------------------
500     133.9424040       1.7355790     124.3040000     136.2590000
-------------------------------------------------------------------

As you see, the mean is close to the target value (here: 133.9) and this should be the case with your data as well, unless the number of omitted sampling units is substantial.

Sachin6816 · Posted 02-16-2023 06:34 AM

Hi! Thanks for this. It is very useful.
I actually have similar problem. In my case both the variables are categorical variables.
Can you please tweak your code where I have to take sample from one data to match the proportion of both variables in a different data set.
So just if you can imagine both var 1 and var2 are catagory in this case, what would be correct code to choose the sample

FreelanceReinh · Posted 02-16-2023 10:56 AM

Hi @Sachin6816,

I think the ALLOC= option of the STRATA statement applied to two stratum variables (var1, var2) is the solution in your case.

Here's a complete example including the creation of test datasets HAVE1, whose var1 and var2 proportions you want to match, and HAVE2, from which you want to draw the random sample (without replacement), stratified by var1 and var2:

/* Create example data for demonstration */

data have1 have2;
set sashelp.heart;
var1=byte(64+whichc(first(BP_Status),'H','N','O')); /* --> values 'A', 'B', 'C' */
var2=(status='Dead'); /* --> values 0, 1 */
if _n_<=2000 then output have1; /* 2000 obs. */
else output have2; /* 3209 obs. */
run;

proc sort data=have2;
by var1 var2;
run;

/* Determine target proportions of (VAR1, VAR2) combinations (A,0), (A,1), (B,0), (B,1), (C,0), (C,1) */

proc freq data=have1;
tables var1*var2 / out=targetprops(drop=count rename=(percent=_alloc_));
run;

/* Draw random sample of size n=1000 from HAVE2 using stratum allocation proportions from HAVE1 */

proc surveyselect data=have2
method=srs n=1000
seed=2718 out=want;
strata var1 var2 / alloc=targetprops;
run;

/* Compare frequency distribution of (VAR1, VAR2) combinations between dataset HAVE1 and the sample */

title 'Dataset HAVE1';
proc freq data=have1;
tables var1*var2;
run;

title 'Sample from HAVE2';
proc freq data=want;
tables var1*var2;
run;
title;

Hint: To get faster and potentially better replies it's recommended to start a new thread for a new question (even if similar to an old question, which you could always refer to with a link). Thus you reach the largest possible audience and not only those who look into old threads. Also, later readers would not be confused about what question is being answered.

Classroom Training Available!