BookmarkSubscribeRSS Feed
AndreaBov
Calcite | Level 5

Hi,

 

i have this simple problem that i am trying to solve with the proc surveyselect but I'm not able to obtain the desired results.

 

I have a dataset with 3 variables: Country (takes only two values EU, NON EU) Segment (takes 3 values Large,Medium,Small) and Dollar (amount of exposure).

 

The dataset has 10000 lines.

 

I would like to extract a random sample of 40 lines that has the same (or close to the same) distribution of the original sample in terms of Dollar.

 

I am using this code:

 

proc sort data=dataset; by Country Segment;

proc surveyselect data =dataset out = samp1 method = pps sampsize=40 seed = 9876 ;
strata Country Segment;
size Dollar;
run;

 

I get a sample of 40 records but the proportion of country and segment weighted by dollar are not the same at all with respect to the original sample.

 

Where am i wrong?

 

1 REPLY 1
ballardw
Super User

That is what PPS does with a Size variable, if a value of Dollar is larger it is more likely to be selected.

If you want the proportion of dollar values to approximate the data as a whole then look at SRS or SYS methods instead. If you have a wide range of values for you dollar amounts I might suggest the SYS method.

 

With 6 groups (2*3) and selecting only 40 records you may have to be flexible about how close you want those proportions to match.