Solved: Re: Proc Surveyselect: If variable=1 always include this row in sample

MWeiler · Posted 10-02-2019 06:17 AM

Hello everyone,

I have a problem that is rather specific and I cannot find any option to solve it:

I have two datasets:

reference_data
application_data

with both having been binned by the same method. The respective bin for each row is stored in a column called "Bin". Furthermore I have a variable called "Flag" which can take on either 0 or 1, whereas the majority of data is set to 0 (since 1 is considered to be an outlier and thus rare but important).

Example reference_data set:

Var1 | Var2 | ... | Bin | Flag

________________________

5.12 | 015 | ... | 0 | 0

5.78 | 0.28 | ... | 0 | 0

3.45 | 0.91 | ... | 1 | 0

3.94 | 0.85 | ... | 1 | 1

8.23 | 0.17 | ... | 2 | 0

.....

From the application_data set I can calculate the percentage of data within the respective bin, which I use to calculate "SAMPSIZE" for "PROC SURVEYSELECT" for the reference_data, e.g:

Bin 1 = 5.12%
Bin2 = 26.87%
...

Using "SAMPSIZE", I now draw random samples from the reference_data which means that the stratified_reference_data composition will match the composition of application_data. This gives me a perfectly stratified reference_data.

However, I need to make sure that all rows in which "Flag = 1" are included in the stratified reference_data set (in this regard random sampling has to be overridden), except when it would cause the stratified reference_data to have a different composition than application_data. (This would cause a bad Population Stability Index).

I hope that I explained the problem understandably, otherwise let me know and I will try to clarify.

Thanks in advance and best regards,

Weiler

FreelanceReinh · Posted 10-02-2019 07:56 AM

Hello @MWeiler and welcome to the SAS Support Communities!

I assume that you use a dataset containing stratum (i.e. bin) sample sizes in the SAMPSIZE= option (let's call it size_data). You could create a subset size_data1 of size_data containing only the bins with at least one observation with Flag=1 in the reference_data. Using size_data1 in the SAMPSIZE= option when sampling from reference_data(where=(flag=1)) would yield the first part of the intended sample. You will need to use the SELECTALL option in that PROC SURVEYSELECT step because most sample sizes will likely exceed the number of flagged observations.

The second part of the sample would be drawn from reference_data(where=(flag=0)) using a SAMPSIZE dataset, say size_data0, which contains for each bin the original sample size (from size_data) minus the frequency observed in the first part of the sample.

Does this make sense?

View solution in original post

andreas_lds · Posted 10-02-2019 06:47 AM

Because i have hardly any experience in statistics i need to see an example of what you have, what you get now and what you need. First thought was, to add the required obs after proc surveyselect to the result-dataset. Maybe (most likely) a bad idea.

data_null__ · Posted 10-02-2019 07:48 AM

I believe you need the CERTSIZE option. However cannot find an example and I haven't figured out how to use it. This paper may be some help it mentions CERTSIZE and why one might want to select observations with certainty. But does not provide an example that I can see.

https://www.mwsug.org/proceedings/2013/AA/MWSUG-2013-AA02.pdf

FreelanceReinh · Posted 10-02-2019 07:56 AM

Hello @MWeiler and welcome to the SAS Support Communities!

I assume that you use a dataset containing stratum (i.e. bin) sample sizes in the SAMPSIZE= option (let's call it size_data). You could create a subset size_data1 of size_data containing only the bins with at least one observation with Flag=1 in the reference_data. Using size_data1 in the SAMPSIZE= option when sampling from reference_data(where=(flag=1)) would yield the first part of the intended sample. You will need to use the SELECTALL option in that PROC SURVEYSELECT step because most sample sizes will likely exceed the number of flagged observations.

The second part of the sample would be drawn from reference_data(where=(flag=0)) using a SAMPSIZE dataset, say size_data0, which contains for each bin the original sample size (from size_data) minus the frequency observed in the first part of the sample.

Does this make sense?

MWeiler · Posted 10-02-2019 09:23 AM

Hello FreelanceRheinhard,

yes this makes sense and is what I actually got working just now.

Thanks for your feedback 🙂

However, I was wondering whether there is an in-built way of doing this?

FreelanceReinh · Posted 10-02-2019 10:44 AM

Glad to see that my suggestion helped you. Indeed, I wouldn't be surprised if more frequent users of PROC SURVEYSELECT came up with some built-in functionality that could serve your purpose (which seems quite common to me), be it CERTSIZE as suggested by data_null__ or something else. There are so many options that I've never used but I'd like to learn more about.

novinosrin · Posted 10-02-2019 09:30 AM

Sir @FreelanceReinh When and if you have a moment, Can you help me/other folks understand whether that works on equal probabilities or subject proportions based on sample size/rates. Thank you in advance

PS You are likely to ask me did i read the doc, yes but my dumb brain didn't understand it. So asking

FreelanceReinh · Posted 10-02-2019 11:16 AM

Hi @novinosrin: My understanding was that the OP had already determined sample sizes based on the strata (=bin) proportions in the application_data. The remaining task was just to prioritize the flagged observations when sampling from the reference_data, but without exceeding the planned sample sizes per stratum.

For example, if the planned sample size for a particular stratum in the reference_data was 5 out of 20 (i.e., each item had a selection probability of 0.25 in the case of simple random sampling) and there were 2 items with Flag=1 in that stratum, the sample should be composed of: the 2 flagged items (selection probability 1) plus 3 out of the 18 items with Flag=0 (thus with a reduced selection probability of 1/6=0.1666...). If, however, 8 items in that stratum had Flag=1, only a random sample of 5 out of these 8 would be drawn (selection probability 0.625 -- always assuming simple random sampling) and none of the remaining 12 items with Flag=0 would be selected.

Does this answer your question?

novinosrin · Posted 10-02-2019 11:19 AM

Thank you Sir @FreelanceReinh that's exactly what I was after. Very neat! Appreciate your time!

ballardw · Posted 10-02-2019 10:41 AM

One way is to add that variable to the strata and make the selection for those strata 100%.

So your samprate information would look like

BIN flag rate

1 0 .0512

1 1 1

2 0 .2687

2 1 1

I have ho idea how complex that may be since you did not provide any actual example code.

Ready to join fellow brilliant minds for the SAS Hackathon?