- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi, everybody,
Did anybody know how I could select a random sample with a condition? Say, I have a 1 MM customers loan dataset. I wanted to select two random samples with a condition that the average loan balance in each sample should close to $800. Did anybody know how I could do it?
Thanks in advance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What does your data look like? and What are your other criteria?
Would a sample of 3 loans, say $10, $800, 1590 be a valid sample?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You probably would have to adjust the code according to your requirement but this would help as start up for you.
Code:
DATA TEMP;
INPUT SAMPLE LOAN;
CARDS;
1 800
1 800
2 100
3 800
4 800
3 800
4 800
;
RUN;
PROC SQL;
SELECT SAMPLE, AVG(LOAN) AS AVG_LOAN FROM TEMP
WHERE RANUNI(111) < 0.55
GROUP BY SAMPLE HAVING AVG_LOAN = 800;
QUIT;
Output:
sample avg_loan
1 800
4 800
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Not being a statistician in my simple world I would expect a RANDOM sample to have the same characteristics than the universe you draw it from. If this is true then you would first have to "tailor" yourself a universe with the desired characteristics.
I assume you first would have to sub-set your source table and then draw the sample from this sub-set (http://support.sas.com/kb/24/722.html).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A similar question was asked at https://communities.sas.com/message/122173
Lots of suggestions there. It's not clear how your $800 requirement relates to the properties of the data. Is $800 the average balance among all customers?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Two things to consider that would considerably change the approach taken:
- What should the variance of the sample sets be? The same for both sets or the same as the population?
- What is the purpose of the control over the mean (=800$). To eliminate the effect of a cofactor or to match the overall loan population?