BookmarkSubscribeRSS Feed
Anotherdream
Quartz | Level 8

Hello. I was presented with a 'sampling solution' by someone else and was asked if there was anything wrong with the solution. The issue is I have no idea. The solution "feels" wrong, but I cannot find any problem with it logically.  Can anyone take a peek and let me know if there are any issues with the designed solution.

Background

Basically a company want to perform 15+ statistical tests on a population, and each test they want to use a specific distribution (lets assume normal) with a 5% assumed error rate, a 2% margin of error and 95% confidence (pop size of 10000 with finite population correction factor. Each test will then require 437 loans.

HOWEVER one loan can be used in multiple tests. An example of a test would be Test1 = "% of loans who person's name was not misspelled" and Test 2="% of loans with balance under 50,000". One loan CAN have both a balance and a name, so it can fit into both buckets. However not all loans will fit into all buckets (some loans might not have person's name for example).

The company that came to the auditors sees the number of loans required is 15 * 437 or 6355 (15 independent random samples of the 437 noted above). They are only willing to do the work if the sample size is ~ 1,000 loans at maximum. To allow for this the auditing company comes up with the following solution.

Solution that I question:
First, take a random sample of X loans (437). Since one loan can fall into multiple buckets, look at each of the 437 loans and split them into the required buckets if they have the attributes associated with that bucket (ex: person name and loan balance). In our example above, the one loan would go into both the Name test, and the Balance test. Therefore this loan is 1 out of the 437 needed in EACH bucket.   Then per bucket once you get 437 loans you are done.

However, many of the 15 buckets will not have 437 loans (Because it is highly likely that less than 437 of the 437 selected loans will apply to the particular test bucket. An example could be... Maybe only 200 loans have a recorded person's name, therefore the other 237 cannot be used in the Test1 labeled bucket).

At this point, find out how many loans you need to fill out 437 per test, and simply sample that many more loans per test. Meaning if Test 1 was under sampled by 237 loans, sample 237 more loans in the population that meet your requirements for Test 1. Then repeat the process for Test #2, etc..

By doing this, your original sample of 437 loans can be used across multiple tests and you would only fill in missing loans by definition. In addition the company says each test is still a statistically random sample and would hold up to third party scrutiny.

Questions 1)
What mistakes (if any) did the auditing company make.

Question 2)
Is the sample a simple random sample? Is it a random sample at all?

Question 3)
Is there anything mistaken with this methodology, assuming the company just needs a random sample and not a simple random sample?

Question 4)

Do the associated samples still obtain the required 95% confidence, 2% margin of error for each of the 15 tests, even though loans were shared between tests?

Question 5)

If the solution given is incorrect, is there any solution that will allow for 15 tests at the required specifications with a total of less than 1,000 loans?

Please let me know if my question did not make sense!

11 REPLIES 11
ballardw
Super User

I can answer question 2 as: Generally NOT a simple random sample. The additional records will have, or at least should, a different weight as they are sampled from a different population (the whole population - already sampled)

Question 5: Possibly, it would depend on the density of the required buckets being common.

I have a thought that I would ask: how much work(read: COST) is involved with getting the sample? If all of the records involved are in a single data source then cost should be negligible and I would just use a larger sample. 437 should be considered a minimal size, assuming "they" did the sample size determination correctly.

Anotherdream
Quartz | Level 8

Hey Ballard. There is a very large cost with performing work on the loans sampled. Meaning the auditor has to audit each loan, and that costs hundreds if not thousands of dollars per loan. So yes, the minimum sample is of vast importance.

I agree with question 2.  I don't know if that matters in this case however which is kinda why I'm stuck..

Can you explain your answer to question 5 a bit more?

I know that of the 15 buckets, a loan WILL usually contain information for about 14-15 of the buckets.  So the density of sharing is VERY large.  I "know" I can get less than 1,000 loans (probably closer to 500 or so) sampled if loans are allowed to fall into more than once bucket  however my question is "is it acceptable statistically to sample 500 loans and then push each of the 500 sampled into 15 different tests IF the original request was to perform a statistically significant random sample for 15 different criteria (buckets) on loans that can be put into each bucket. Each of which will use the two tailed normal, with the above assumptions (which gives 437 sample size required)".....

So basically my question is.  If I have 15 tests to perform, all of which require a sample size of 437,  but 1 Item "loan" can satisy all 15 tests can I simply sample 437 loans randomly from the population, and then perform each of the 15 tests on these loans?  .I can do this form a process stand point, and it works conceptually and it will result in a much smaller sample size, but is there any problem statistically with doing this?

Hope that makes sense

ballardw
Super User

I hesitate to quite say "a random sample is a random sample" but think of all of the survey research done. Many surveys cover multiple topics from one sample. There is actually an advantage to having a shared sample for some modeling purposes.

My thought is that if you "know" that 80% of your records should have all 15 needed elements than pick a sample size such that .80 x sample size is at least 437, though it might be a good idea, again depending on cost of collection compared to any loss of precision in analysis to fudge upwards just a bit.

Anotherdream
Quartz | Level 8

Got ya. I actually see you point about survey research.  I guess my "gut-reaction" was as follows.

If 1,000 people are all asked 6 quesitons, you technically have 6 statistical tests, each of which has a sample size of 1,000.   Ex: If one question was 'do you like television', you should be able to take the 1,000 people sampled and then say "X% of the population likes telvision with such and such confidence interval.." you could then do the same for the other questions being asked.  However if you got a 'bad sample' (sometimes the true population mean doesn't fit within your confidence interval by the fundamental nature of statistics),  and other questions are correlated to this question then a problem might arrise.

Example: What if another question asked is "do you like radio"?  Maybe there is a high correlation between not liking radio and television, and maybe the 1,000 people you sampled was a collection of people that was not 'normal' from the population and didn't like television. Well then your question on Radio is now also biased....

This is what went through my brain... Am I wrong in an assumption I've made above?  (by the way thank you very much for your help)...

Or perhaps is my assumption correct and this is a fundamental flaw in not performing 6 independent random samples for each survey question.... You get a much smaller sample size, but accept some un-reported Correlation between the questions being asked?

ballardw
Super User

If your sample is (reasonably) random then your example of radio being biased is not true UNLESS you only test for those that do not like TV. Of course if you phrase the analysis "Among those respondents that dislike TV xx percent also dislike radio" then you describe something other than "among all respondents yy percent dislike radio".

Actually the advantage is that you can test that supposition with more strength than two independent samples as well as sometime find other associated behaviors.

Anotherdream
Quartz | Level 8

Hey BallardW. Your responses are helping very much so thanks in advance. 

I do want to test the sentence "among all respondents yy percent dislike radio" and independently the sentence "among all respondents XX percent dislike television", and specifically not those who dislike radio who also dislike Television (or vise versa).

And I understand that if the sample is truly random then the sample mean approaches the true population mean as n gets larger.


The part I'm struggling with is as follows.  By definition a random sample can be 'abnomral', meaning it was an outlier (using a 95% confidence level to build a confidence interval from a random sample gives you a bounded estimate of the true population mean. However it implies only 95% of all sample bounded estimates of this range would actually contain the true population paramater, therefore it's quite possible we got one of the 5% samples.).

Lets say for example our sample says that 33% of people like television, with a 95% confidence interval of 31-35 % like television, but the TRUE value in the population was 36.8%.   Then our sample that gave us 33% was one of the 2.5% abnormalities and the sample was just 'unluckly' (while still being random).

Now assume that everyone who like television also like radio.  Our sample estimate was 3.8% under what was expected for television likers, therefore wouldn't we also be 3.8% under estimated on the population proportion of people who like radio (since the one test is perfectly correlated with the other test?)

Does the statement I made above make sense? That is the confusion I am struggling with. Thanks a bunch

ballardw
Super User

You have the likelihood of that type error with any sample. You made the decision at the start that a 95% confidence interval (willing to take a chance 5% of the time) was good enough for your purposes and are willing to live with the consequences. If that isn't the case you need a different confidence interval and associated costs.

You might play around with Proc Power to see the effects of different sample scenarios.

Anotherdream
Quartz | Level 8

Ah, so your counter point is that we accepted a 5% chance of an error occuring on every test (which is very true).  So by definition an error occuring on one test could affect the results of another test but we have to be okay with that....   I guess that's just hard for me to understand for the following reason.

If you randomly sampled 437 loans for test 1, and 437 loans for test #2 out of 100,000,000 loans, and test #1 was a bad sample (one of the samples where the confidence interval doesn't contain the true mean) which will happen 5% of the time...   It doesn't imply that test #2 is also an error, even if they are perfectly correlated because different loans will likely be sampled for test #1 and #2.


However in our example they are both an error by definition. (the same loans in both tests with perfect correlation).  To me this implies this methodology is wrong because the 5% chance of error is actually greater than 5% because you are sharing data between tests....

Do you see where I am coming from?

Reeza
Super User

If you're doing multiple tests on the same samples you need to use a correction method (e.g. bonferronni) and therefore actually increase your sample size to get your 5% acceptable margin of error.

Anotherdream
Quartz | Level 8

Hey Reeza. So you're saying if you  take a group of 437 randomly selected people and ask them two questions you cannot make two independent hypothesis tests (one for each question) at a pre determined confidence level?

Example if we came in with the following two  independent null hypothesis

First Null Hypothesis

"less than 10% of the population likes tv"

Second Null Hypothesis

"less than 21% of the population likes red hair"

You're saying we shouldn't design a survey that asks people the two questions 1) "do you like tv"     and 2) "do you like red hair"    because we  couldn't use the results of these questions to test both null hypothesis separately, but each at a 95% confidence level (95% confidence level on each test).

That's what the company is trying to do, but you're saying it's wrong, correct?  I'll try to research the algorithm you suggested however can you provide any more in-sight to make sure I understood you correctly?

Thanks very much!

Anotherdream
Quartz | Level 8

Hey again @Reeza  and @ballardw  Thanks very much for all of your help!  I feel like I will have taken an undergraduate class in statistics by this time and your patience is extremely appreciated. 

I did some research into the Bonferronni method and the problem of multiple comparsion in statistical testing and I had a quick question.  Doesn't the problem of multiple comparsion theoretically apply to any groups of statistical tests regardless if they came from the same population or not?

Example:  if you wanted to test 100 null hypothesis, all of which were un-related,  you could either sample one population and ask 100 different questions, or sample 100 independent populations and ask each one question.

In both cases, each test has a 5% chance of a type one error (assuming 95% confidence), so on average you'd expect 5 of the tests to incorrectly identify the null hypothesis, and the probability of at least 1 test incorrectly rejecting the null would be over 99.3% (I just used the Poisson distribution here and 1- P(0 events given mean of 5)). 

So from what I read, I think everything implies that the entire idea of the 'the correction factors applies equally to one sample or many independent samples...  Is that a correct statement to make in your opinion?

Thanks again

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 1880 views
  • 0 likes
  • 3 in conversation