BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MWeiler
Calcite | Level 5

Hello everyone,

 

I have a problem that is rather specific and I cannot find any option to solve it:

 

I have two datasets:

  • reference_data
  • application_data

with both having been binned by the same method. The respective bin for each row is stored in a column called "Bin". Furthermore I have a variable called "Flag" which can take on either 0 or 1, whereas the majority of data is set to 0 (since 1 is considered to be an outlier and thus rare but important).

 

Example reference_data set:

 

Var1 | Var2 | ... | Bin    | Flag

________________________

5.12 | 015  | ...  | 0       |  0

5.78 | 0.28 | ...  | 0       |  0

3.45 | 0.91 | ...  | 1       |  0

3.94 | 0.85 | ...  | 1       |  1

8.23 | 0.17 | ...  | 2       |  0

.....

 

From the application_data set I can calculate the percentage of data within the respective bin, which I use to calculate "SAMPSIZE" for "PROC SURVEYSELECT" for the reference_data, e.g:

 

  • Bin 1 = 5.12%
  • Bin2 = 26.87%
  • ...

 

Using "SAMPSIZE", I now draw random samples from the reference_data which means that the stratified_reference_data composition will match the composition of application_data. This gives me a perfectly stratified reference_data.

 

However, I need to make sure that all rows in which "Flag = 1" are included in the stratified reference_data set (in this regard random sampling has to be overridden), except when it would cause the stratified reference_data to have a different composition than application_data. (This would cause a bad Population Stability Index).

 

I hope that I explained the problem understandably, otherwise let me know and I will try to clarify.

 

Thanks in advance and best regards,

 

Weiler

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @MWeiler and welcome to the SAS Support Communities!

 

I assume that you use a dataset containing stratum (i.e. bin) sample sizes in the SAMPSIZE= option (let's call it size_data). You could create a subset size_data1 of size_data containing only the bins with at least one observation with Flag=1 in the reference_data. Using size_data1 in the SAMPSIZE= option when sampling from reference_data(where=(flag=1)) would yield the first part of the intended sample. You will need to use the SELECTALL option in that PROC SURVEYSELECT step because most sample sizes will likely exceed the number of flagged observations.

 

The second part of the sample would be drawn from reference_data(where=(flag=0)) using a SAMPSIZE dataset, say size_data0, which contains for each bin the original sample size (from size_data) minus the frequency observed in the first part of the sample.

 

Does this make sense?

 

View solution in original post

9 REPLIES 9
andreas_lds
Jade | Level 19

Because i have hardly any experience in statistics i need to see an example of what you have, what you get now and what you need. First thought was, to add the required obs after proc surveyselect to the result-dataset. Maybe (most likely) a bad idea.

data_null__
Jade | Level 19

I believe you need the CERTSIZE option.  However cannot find an example and I haven't figured out how to use it.  This paper may be some help it mentions CERTSIZE and why one might want to select observations with certainty.  But does not provide an example that I can see.

 

https://www.mwsug.org/proceedings/2013/AA/MWSUG-2013-AA02.pdf

FreelanceReinh
Jade | Level 19

Hello @MWeiler and welcome to the SAS Support Communities!

 

I assume that you use a dataset containing stratum (i.e. bin) sample sizes in the SAMPSIZE= option (let's call it size_data). You could create a subset size_data1 of size_data containing only the bins with at least one observation with Flag=1 in the reference_data. Using size_data1 in the SAMPSIZE= option when sampling from reference_data(where=(flag=1)) would yield the first part of the intended sample. You will need to use the SELECTALL option in that PROC SURVEYSELECT step because most sample sizes will likely exceed the number of flagged observations.

 

The second part of the sample would be drawn from reference_data(where=(flag=0)) using a SAMPSIZE dataset, say size_data0, which contains for each bin the original sample size (from size_data) minus the frequency observed in the first part of the sample.

 

Does this make sense?

 

MWeiler
Calcite | Level 5

Hello FreelanceRheinhard,

 

yes this makes sense and is what I actually got working just now.

 

Thanks for your feedback 🙂

 

However, I was wondering whether there is an in-built way of doing this?

FreelanceReinh
Jade | Level 19

Glad to see that my suggestion helped you. Indeed, I wouldn't be surprised if more frequent users of PROC SURVEYSELECT came up with some built-in functionality that could serve your purpose (which seems quite common to me), be it CERTSIZE as suggested by data_null__ or something else. There are so many options that I've never used but I'd like to learn more about.

novinosrin
Tourmaline | Level 20

Sir @FreelanceReinh   When and if you have a moment, Can you help me/other folks understand whether that works on equal probabilities or subject proportions based on sample size/rates. Thank you in advance

 

PS You are likely to ask me did i read the doc, yes but my dumb brain didn't understand it. So asking 

FreelanceReinh
Jade | Level 19

Hi @novinosrin: My understanding was that the OP had already determined sample sizes based on the strata (=bin) proportions in the application_data. The remaining task was just to prioritize the flagged observations when sampling from the reference_data, but without exceeding the planned sample sizes per stratum.

 

For example, if the planned sample size for a particular stratum in the reference_data was 5 out of 20 (i.e., each item had a selection probability of 0.25 in the case of simple random sampling) and there were 2 items with Flag=1 in that stratum, the sample should be composed of: the 2 flagged items (selection probability 1) plus 3 out of the 18 items with Flag=0 (thus with a reduced selection probability of 1/6=0.1666...). If, however, 8 items in that stratum had Flag=1, only a random sample of 5 out of these 8 would be drawn (selection probability 0.625 -- always assuming simple random sampling) and none of the remaining 12 items with Flag=0 would be selected.

 

Does this answer your question?

novinosrin
Tourmaline | Level 20

Thank you Sir @FreelanceReinh   that's exactly what I was after. Very neat! Appreciate your time!

ballardw
Super User

One way is to add that variable to the strata and make the selection for those strata 100%.

So your samprate information would look like

 

BIN    flag   rate

1        0        .0512

1       1         1

2        0         .2687

2        1         1

 

 

I have ho idea how complex that may be since you did not provide any actual example code.

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1427 views
  • 2 likes
  • 6 in conversation