Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Proc Surveyselect: If variable=1 always include this row in sample

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 10-02-2019 06:17 AM
(1162 views)

Hello everyone,

I have a problem that is rather specific and I cannot find any option to solve it:

I have two datasets:

- reference_data
- application_data

with both having been binned by the same method. The respective bin for each row is stored in a column called "Bin". Furthermore I have a variable called "Flag" which can take on either 0 or 1, whereas the majority of data is set to 0 (since 1 is considered to be an outlier and thus rare but important).

Example *reference_data* set:

Var1 | Var2 | ... | Bin | Flag

________________________

5.12 | 015 | ... | 0 | 0

5.78 | 0.28 | ... | 0 | 0

3.45 | 0.91 | ... | 1 | 0

3.94 | 0.85 | ... | 1 | 1

8.23 | 0.17 | ... | 2 | 0

.....

From the *application_data* set I can calculate the percentage of data within the respective bin, which I use to calculate "SAMPSIZE" for "PROC SURVEYSELECT" for the *reference_data*, e.g:

- Bin 1 = 5.12%
- Bin2 = 26.87%
- ...

Using "SAMPSIZE", I now draw random samples from the *reference_data* which means that the *stratified_reference_data* composition will match the composition of *application_data*. This gives me a perfectly stratified reference_data.

However, I need to make sure that all rows in which "Flag = 1" are included in the *stratified reference_data* set (in this regard random sampling has to be overridden), except when it would cause the stratified reference_data to have a different composition than application_data. (This would cause a bad Population Stability Index).

I hope that I explained the problem understandably, otherwise let me know and I will try to clarify.

Thanks in advance and best regards,

Weiler

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello @MWeiler and welcome to the SAS Support Communities!

I assume that you use a dataset containing stratum (i.e. bin) sample sizes in the SAMPSIZE= option (let's call it size_data). You could create a subset size_data1 of size_data containing only the bins with at least one observation with Flag=1 in the reference_data. Using size_data1 in the SAMPSIZE= option when sampling from reference_data**(where=(flag=1))** would yield the first part of the intended sample. You will need to use the SELECTALL option in that PROC SURVEYSELECT step because most sample sizes will likely exceed the number of flagged observations.

The second part of the sample would be drawn from reference_data**(where=(flag=0))** using a SAMPSIZE dataset, say size_data0, which contains for each bin the original sample size (from size_data) minus the frequency observed in the first part of the sample.

Does this make sense?

9 REPLIES 9

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I believe you need the CERTSIZE option. However cannot find an example and I haven't figured out how to use it. This paper may be some help it mentions CERTSIZE and why one might want to select observations with certainty. But does not provide an example that I can see.

https://www.mwsug.org/proceedings/2013/AA/MWSUG-2013-AA02.pdf

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello @MWeiler and welcome to the SAS Support Communities!

I assume that you use a dataset containing stratum (i.e. bin) sample sizes in the SAMPSIZE= option (let's call it size_data). You could create a subset size_data1 of size_data containing only the bins with at least one observation with Flag=1 in the reference_data. Using size_data1 in the SAMPSIZE= option when sampling from reference_data**(where=(flag=1))** would yield the first part of the intended sample. You will need to use the SELECTALL option in that PROC SURVEYSELECT step because most sample sizes will likely exceed the number of flagged observations.

The second part of the sample would be drawn from reference_data**(where=(flag=0))** using a SAMPSIZE dataset, say size_data0, which contains for each bin the original sample size (from size_data) minus the frequency observed in the first part of the sample.

Does this make sense?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello FreelanceRheinhard,

yes this makes sense and is what I actually got working just now.

Thanks for your feedback 🙂

However, I was wondering whether there is an in-built way of doing this?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Sir @FreelanceReinh When and if you have a moment, Can you help me/other folks understand whether that works on equal probabilities or subject proportions based on sample size/rates. Thank you in advance

PS You are likely to ask me did i read the doc, yes but my dumb brain didn't understand it. So asking

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi @novinosrin: My understanding was that the OP had already determined sample sizes based on the strata (=bin) proportions in the application_data. The remaining task was just to *prioritize* the flagged observations when sampling from the reference_data, but without exceeding the planned sample sizes per stratum.

For example, if the planned sample size for a particular stratum in the reference_data was 5 out of 20 (i.e., each item had a selection probability of 0.25 in the case of simple random sampling) and there were 2 items with Flag=1 in that stratum, the sample should be composed of: the 2 flagged items (selection probability 1) plus 3 out of the 18 items with Flag=0 (thus with a reduced selection probability of 1/6=0.1666...). If, however, 8 items in that stratum had Flag=1, only a random sample of 5 out of these 8 would be drawn (selection probability 0.625 -- always assuming simple random sampling) and none of the remaining 12 items with Flag=0 would be selected.

Does this answer your question?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you Sir @FreelanceReinh that's exactly what I was after. Very neat! Appreciate your time!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

One way is to add that variable to the strata and make the selection for those strata 100%.

So your samprate information would look like

BIN flag rate

1 0 .0512

1 1 1

2 0 .2687

2 1 1

I have ho idea how complex that may be since you did not provide any actual example code.

**Don't miss out on SAS Innovate - Register now for the FREE Livestream!**

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.