BookmarkSubscribeRSS Feed
leex1514
Calcite | Level 5

I still have no 1s of var2 in samp. I don't know why proc survey does not pick 0 and 1 in the samp.

leex1514
Calcite | Level 5

Here are the outputs (var2=w).

 

Obs

W

_alloc_

1

0

62.8524

2

1

37.1476

 

 

Obs

r

r0

r1

1

5000

3143

1857

 

 

Obs

Replicate

W

COUNT

PERCENT

1

1

0

3143

0.2

2

2

0

3143

0.2

3

3

0

3143

0.2

4

4

0

3143

0.2

5

5

0

3143

0.2

6

6

0

3143

0.2

7

7

0

3143

0.2

8

8

0

3143

0.2

9

9

0

3143

0.2

10

10

0

3143

0.2

 

FreelanceReinh
Jade | Level 19

@leex1514 wrote:

Here are the outputs (var2=w).

So, the real name of your "var2" happens to be w -- which is the name I arbitrarily chose for the weight variable? This trivial name conflict might explain the nonsensical results. Can you replace the w from the code I provided by a different name, say, _w or whatever does not occur in your data elsewhere?

leex1514
Calcite | Level 5

For that reason, I changed weight w to w1 and leave var2 as w. But I will change the var2 name to be something completely different to avoid confusion. 

FreelanceReinh
Jade | Level 19

Thanks. With similar, but different variable names for "VAR2" and the weight variable no name conflict should occur. If your check of the variable names does not solve the problem, the next step will be to examine the log of the PROC SURVEYSELECT step, i.e., the section which looks like this:

67   proc surveyselect data=have2_wgt rep=500
68   method=pps n=&r
69   seed=2718 out=samp;
70   size w;
71   strata var2 / alloc=targetprops;
72   run;

NOTE: 9 sampling units were omitted due to missing or nonpositive size measures.
NOTE: The above message was for the following stratum:
      var2=1.
NOTE: The data set WORK.SAMP has 500000 observations and 28 variables.
NOTE: PROCEDURE SURVEYSELECT used (Total process time):
      real time           1.09 seconds
      cpu time            1.10 seconds

(It's important to use the "</>" (Insert Code) button to post the log in order to preserve formatting.)

 

As shown in the example, not only warnings and errors, but also notes in the log can indicate certain issues such as the omission of sampling units.

 

With rep=500 and &r=5000 the number of observations in WORK.SAMP should be 2500000. But your PROC FREQ output dataset already revealed that this is not the case: It appears that your WORK.SAMP has only about 1571500 (=500*3143) observations.

leex1514
Calcite | Level 5

I found my error of var2. It occurred when I applied coefficients a and b. I fixed it and ran the whole analysis but still my var1's mean is way higher than the target mean. Is this problem inheriting in my data, nothing much can be done? 

Dataset R4_502 /*HAVE1*/

The MEANS Procedure 

Variable

N

Mean

readingscalescore1

228928

357.0903559

white

228989

0.3714764

 

 

Sample from R4_503 /*HAVE2*/

The MEANS Procedure 

Variable

N

Mean

readingscalescore1

5000

387.1486000

white

5000

0.3714000

 

 

 

 

FreelanceReinh
Jade | Level 19

@leex1514 wrote:

... still my var1's mean is way higher than the target mean. Is this problem inheriting in my data, nothing much can be done? 

Dataset R4_502 /*HAVE1*/

The MEANS Procedure 

Variable

N

Mean

readingscalescore1

228928

357.0903559

white

228989

0.3714764

 

 

Sample from R4_503 /*HAVE2*/

The MEANS Procedure 

Variable

N

Mean

readingscalescore1

5000

387.1486000

white

5000

0.3714000

 

 

 

 


I'd be surprised if the variability of your data was so large that no better match would be possible (with 500 replications). So, is the 387.1486 already the result of picking the mean closest to the target (357.09...)?

 

After the log of the PROC SURVEYSELECT step (especially the number of sampling units omitted, if any) I would look at the distribution of the 500 sample means:

proc means data=sampmeans;
var ms;
run;

Result for the test data:

                      Analysis Variable : ms

  N            Mean         Std Dev         Minimum         Maximum
-------------------------------------------------------------------
500     133.9424040       1.7355790     124.3040000     136.2590000
-------------------------------------------------------------------

As you see, the mean is close to the target value (here: 133.9) and this should be the case with your data as well, unless the number of omitted sampling units is substantial.

Sachin6816
Calcite | Level 5
Hi! Thanks for this. It is very useful.
I actually have similar problem. In my case both the variables are categorical variables.
Can you please tweak your code where I have to take sample from one data to match the proportion of both variables in a different data set.
So just if you can imagine both var 1 and var2 are catagory in this case, what would be correct code to choose the sample
FreelanceReinh
Jade | Level 19

Hi @Sachin6816,

 

I think the ALLOC= option of the STRATA statement applied to two stratum variables (var1, var2) is the solution in your case.

 

Here's a complete example including the creation of test datasets HAVE1, whose var1 and var2 proportions you want to match, and HAVE2, from which you want to draw the random sample (without replacement), stratified by var1 and var2:

/* Create example data for demonstration */

data have1 have2;
set sashelp.heart;
var1=byte(64+whichc(first(BP_Status),'H','N','O')); /* --> values 'A', 'B', 'C' */
var2=(status='Dead'); /* --> values 0, 1 */
if _n_<=2000 then output have1; /* 2000 obs. */
else output have2; /* 3209 obs. */
run;

proc sort data=have2;
by var1 var2;
run;

/* Determine target proportions of (VAR1, VAR2) combinations (A,0), (A,1), (B,0), (B,1), (C,0), (C,1) */

proc freq data=have1;
tables var1*var2 / out=targetprops(drop=count rename=(percent=_alloc_));
run;

/* Draw random sample of size n=1000 from HAVE2 using stratum allocation proportions from HAVE1 */

proc surveyselect data=have2
method=srs n=1000
seed=2718 out=want;
strata var1 var2 / alloc=targetprops;
run;

/* Compare frequency distribution of (VAR1, VAR2) combinations between dataset HAVE1 and the sample */

title 'Dataset HAVE1';
proc freq data=have1;
tables var1*var2;
run;

title 'Sample from HAVE2';
proc freq data=want;
tables var1*var2;
run;
title;

 

Hint: To get faster and potentially better replies it's recommended to start a new thread for a new question (even if similar to an old question, which you could always refer to with a link). Thus you reach the largest possible audience and not only those who look into old threads. Also, later readers would not be confused about what question is being answered.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 23 replies
  • 1481 views
  • 1 like
  • 5 in conversation