Hello,
I have a table that looks like that:
PackageID | LenderID | DealAmount | … |
100 | 8 | 100000000 | … |
100 | 103 | 100000000 | … |
100 | 500 | 100000000 | … |
100 | 493 | 100000000 | … |
100 | 359 | 100000000 | … |
100 | 245 | 100000000 | … |
100 | 444 | 100000000 | … |
100 | 1234 | 100000000 | … |
101 | 2 | 254000000 | … |
101 | 8 | 254000000 | … |
101 | 103 | 254000000 | … |
101 | 87 | 254000000 | … |
101 | 237 | 254000000 | … |
102 | 503 | 25000000 | … |
102 | 8 | 25000000 | … |
… | … | … | … |
I have a PackageID which is a number associated to a loan.
I have many lenders for a PackageID (LenderID is the identification number associated to a lender). The number of lenders by PackageID may vary between two and 100 approximately.
I want to test some hypothesis. For this purpose, I must select randomly half of the lenders by PackageID, the third of the lenders by PackageID, a quarter of the lenders by PackageID and so on.
What should I do? Most of the time, if I ask to select the third of the lenders (for example), it won't result in an integer for the number of lenders. How does I treat that in a code?
Thank you in advance!
With that many PackageIDs, simply take 1 replicate (with method=BERNOUILLI) and you will get close to the required sampling rate, overall. That's what the SURVEYSELECT procedure is meant to do. - PG
Try looking into PROC SURVEYSELECT for starters.
Try this one. You can vary "samprate" to get 25%, 50% and 75% samples stratified by PackageID
proc surveyselect data = have out = want method = srs samprate = .5 seed = 1230;
PackageID;
run;
Building on suggestion, you may circumvent the small strata sizes problem with replicated sampling. The idea is to repeat sample selection so that on average you get the proportion you want. For example :
proc surveyselect data=have out=want method=BERNOULLI samprate=0.333333 seed=8757587 rep=50;
strata PackageID;
run;
proc print data=want noobs; where PackageID=102; run;
Package Lender Deal Selection Sampling
ID Replicate ID Amount Prob Weight
102 1 8 25000000 0.33333 3.00000
102 3 503 25000000 0.33333 3.00000
102 3 8 25000000 0.33333 3.00000
102 5 503 25000000 0.33333 3.00000
102 5 8 25000000 0.33333 3.00000
102 6 8 25000000 0.33333 3.00000
102 8 8 25000000 0.33333 3.00000
102 13 503 25000000 0.33333 3.00000
102 15 8 25000000 0.33333 3.00000
102 16 503 25000000 0.33333 3.00000
102 16 8 25000000 0.33333 3.00000
102 17 503 25000000 0.33333 3.00000
102 18 503 25000000 0.33333 3.00000
102 19 503 25000000 0.33333 3.00000
102 19 8 25000000 0.33333 3.00000
102 23 503 25000000 0.33333 3.00000
102 25 503 25000000 0.33333 3.00000
102 31 503 25000000 0.33333 3.00000
102 32 503 25000000 0.33333 3.00000
102 32 8 25000000 0.33333 3.00000
102 33 503 25000000 0.33333 3.00000
102 35 503 25000000 0.33333 3.00000
102 35 8 25000000 0.33333 3.00000
102 36 503 25000000 0.33333 3.00000
102 39 503 25000000 0.33333 3.00000
102 44 503 25000000 0.33333 3.00000
102 46 503 25000000 0.33333 3.00000
102 46 8 25000000 0.33333 3.00000
102 47 503 25000000 0.33333 3.00000
102 48 503 25000000 0.33333 3.00000
102 49 8 25000000 0.33333 3.00000
Note that for some replicates both lenders in Package 102 are selected, for others, only one lender is selected and some replicates (24 of them) are empty, for an average selection probability of one third.
PG
Sorry for the delay. Thank you Reeza, stat@sas and PG Stats.
PG Stats's response is pretty near of what I want.
More details:
I have approximatly 125 000 packageIDs so fifty replicates are huge for SAS to run.
More importantly, for the purpose of my further codes (which I have a good idea), my final table must only contain LenderID's numbers chosen randomly without any replicate. I don't know if it is possible.For example, if my samprate rate is 0,333333, can a code take all LenderIDs if a packageID has one or two LenderIDs (?) and if a packageID contains three or more LenderIDs, is it possible that the code rounds to the near number to select? (In this way, my code would choose one LenderID for a packageID of three LenderIDs, one LenderID for a packageID of four LenderIDs (round(1,33333)=1), two LenderIDs for a packageID of five LenderIDS (round(1,666666)=2) and so on.
Thank you so much for your help. When you start programming, it is hard to figure out what kind of codes work. I appreciate your guidance.
With that many PackageIDs, simply take 1 replicate (with method=BERNOUILLI) and you will get close to the required sampling rate, overall. That's what the SURVEYSELECT procedure is meant to do. - PG
Thank you so much.
Hello,
Sorry for asking again.
1) The resulting table excludes some packageID (91 152 packageIDs at the end VS 124 001 packageIDs at the start).
2) The number of lenders chosen randomly is also not exact. For example, if a packageID has 28 lenders, the resulting table contains 11 lenders when it should have 9 lenders. It's just one example. It seems that the number of lenders is overestimated in general.
Do you have any idea what can cause these inaccuracies?
Thank you in advance.
What is the total number of lenders before and after sampling?
Before the
sampling : 2065 distinct LenderIDs
After the
sampling : 2027 distinct LenderIDs
I wouldn’t be surprised if the number of lenderIDs after the sampling was 2055 for
example. Some lenders participate occasionally (under 10 packageID). 2027 is still possible.
Under method=BERNOUILLI the sample size in each stratum is a random variable which gives the requested sampling rate on average. Depending on the purpose of your sample, you may prefer method=SRS (for Simple Random Sampling) where the effective sample size is MAX(1, ROUND(N*samprate)), where N is the stratum size. Under SRS, strata with a single observation will always be sampled, strata with two observations will always have one and only one observation selected, and so on. If you have a large proportion of your strata with one or two observations, and are requesting a sampling rate lower than 50%, the effective sampling rate for these strata (100 and 50%) will necessarily inflate the overall sampling rate. In other words, small strata will be overrepresented in your sample. But you may prefer that over having random sample sizes and missing strata in your sample.
PG
Thank you so much!
Under SRS, my problems disappear:
1) I keep 124 001 packageIDs before and after the sampling.
2) The numbers of lenderID seem to be right.
One detail: If I want 28 x 0,33333= 9,33333, it keeps 10 lenderIDs. Maybe I would prefer that it keeps 9 lenderIDs but I can live with that.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.