Obsidian | Level 7

## How to select specific random entries based on if condition?

I have customers data along with their occupation, now instead of taking a random sample from the whole base I just want to randomly select specific customers from a  specific occupation and keep the customers from other occupations as it is in the output table.

For example:-

In the data, I have 1 lac customers working in the private sector, and the count of customers from other occupations is less than 10 thousand. Now I want to randomly select only 10 thousand customers from the private sector and want to keep the customers from other occupations as it is in the output data

1 ACCEPTED SOLUTION

Accepted Solutions

## Re: How to select specific random entries based on if condition?

Hello @Saurabh_Rana,

You can use PROC SURVEYSELECT with the SELECTALL option and a STRATA statement (like STRATA occupation;) and then specify, e.g., n=10000 as the sample size. This will draw random samples of 10,000 observations (customers) per occupation if possible and select all observations from smaller strata (with <=10,000 observations). Or specify individual sample sizes for each stratum. With the SELECTALL option it doesn't hurt if some of the sample sizes are actually too large.

Example:

``````/* Create example dataset, sorted by stratum (here: age group) */

proc sort data=sashelp.class out=class;
by age;
run; /* Six age groups (strata): 11, 12, ..., 16. */

/* If possible, select 3 randomly from each group, else select all */

proc surveyselect data=class
method=srs n=3 selectall
seed=2718 out=samp;
strata age;
run;

/* Example with individual sample sizes for each of the six strata */

proc surveyselect data=class
method=srs n=(2 1 4 4 3 2) selectall
seed=2718 out=samp2;
strata age;
run;``````

Compare PROC FREQ results (tables age;) for CLASS, SAMP and SAMP2 to see the effect of the N= and SELECTALL options.

3 REPLIES 3

## Re: How to select specific random entries based on if condition?

Hello @Saurabh_Rana,

You can use PROC SURVEYSELECT with the SELECTALL option and a STRATA statement (like STRATA occupation;) and then specify, e.g., n=10000 as the sample size. This will draw random samples of 10,000 observations (customers) per occupation if possible and select all observations from smaller strata (with <=10,000 observations). Or specify individual sample sizes for each stratum. With the SELECTALL option it doesn't hurt if some of the sample sizes are actually too large.

Example:

``````/* Create example dataset, sorted by stratum (here: age group) */

proc sort data=sashelp.class out=class;
by age;
run; /* Six age groups (strata): 11, 12, ..., 16. */

/* If possible, select 3 randomly from each group, else select all */

proc surveyselect data=class
method=srs n=3 selectall
seed=2718 out=samp;
strata age;
run;

/* Example with individual sample sizes for each of the six strata */

proc surveyselect data=class
method=srs n=(2 1 4 4 3 2) selectall
seed=2718 out=samp2;
strata age;
run;``````

Compare PROC FREQ results (tables age;) for CLASS, SAMP and SAMP2 to see the effect of the N= and SELECTALL options.

Obsidian | Level 7

## Re: How to select specific random entries based on if condition?

What if instead of capping the max sample count, I want to define the max proportion percentage. Basically, can I define the maximum allowed proportion any occupation can have?

## Re: How to select specific random entries based on if condition?

You can specify target percentages for each stratum in the ALLOC= option of the STRATA statement (either as a list of values or in the form of a dataset).

Example (continuing my previous post):

``````proc surveyselect data=class
method=srs n=10
seed=2718 out=samp;
strata age / alloc=(10 20 20 20 20 10);
run;``````

This requests proportions of 10%, 20%, ..., 10% for the six age groups 11, 12, ..., 16 with a total sample size of n=10. (Instead of percentages 10, 20, ... you may write proportions like 0.1, 0.2, ... in the list. The sum must be 100 or 1, respectively, up to a little rounding error as in 0.167 for 1/6.)

Note, however, that the numbers in the list cannot always be attained exactly (e.g., because the sample size of a stratum is necessarily an integer and cannot be greater than the size of the stratum). This includes cases where the actual proportion of a stratum exceeds the corresponding allocated proportion. Change n=10 to n=11 in the code above to see such an example. But if you specify realistic proportions (based on your knowledge of the stratum sizes), the result should be satisfactory.

Discussion stats
• 3 replies
• 672 views
• 0 likes
• 2 in conversation