Random selection in a table

Accepted Solution Solved
Reply
Contributor
Posts: 62
Accepted Solution

Random selection in a table

Hello,

I have a table that looks like that:

PackageIDLenderIDDealAmount
1008100000000
100103100000000
100500100000000
100493100000000
100359100000000
100245100000000
100444100000000
1001234100000000
1012254000000
1018254000000
101103254000000
10187254000000
101237254000000
10250325000000
102825000000

I have a PackageID which is a number associated to a loan.

I have many lenders for a PackageID (LenderID is the identification number associated to a lender). The number of lenders by PackageID may vary between two and 100 approximately.

I want to test some hypothesis. For this purpose, I must select randomly half of the lenders by PackageID, the third of the lenders by PackageID, a quarter of the lenders by PackageID and so on.

What should I do? Most of the time, if I ask to select the third of the lenders (for example), it won't result in an integer for the number of lenders. How does I treat that in a code?

Thank you in advance!


Accepted Solutions
Solution
‎05-17-2014 04:47 PM
Respected Advisor
Posts: 4,919

Re: Random selection in a table

With that many PackageIDs, simply take 1 replicate (with method=BERNOUILLI) and you will get close to the required sampling rate, overall. That's what the SURVEYSELECT procedure is meant to do.  - PG

PG

View solution in original post


All Replies
Super User
Posts: 19,769

Re: Random selection in a table

Try looking into PROC SURVEYSELECT for starters.

Trusted Advisor
Posts: 1,228

Re: Random selection in a table

Try this one. You can vary "samprate" to get 25%, 50% and 75% samples stratified by PackageID

proc surveyselect data = have out = want method = srs samprate = .5 seed = 1230;

PackageID;

run;

Respected Advisor
Posts: 4,919

Re: Random selection in a table

Building on suggestion, you may circumvent the small strata sizes problem with replicated sampling. The idea is to repeat sample selection so that on average you get the proportion you want. For example :

proc surveyselect data=have out=want method=BERNOULLI samprate=0.333333 seed=8757587 rep=50;

strata PackageID;

run;

proc print data=want noobs; where PackageID=102; run;

       Package                 Lender      Deal      Selection    Sampling

           ID      Replicate      ID       Amount        Prob       Weight

          102           1           8     25000000     0.33333      3.00000

          102           3         503     25000000     0.33333      3.00000

          102           3           8     25000000     0.33333      3.00000

          102           5         503     25000000     0.33333      3.00000

          102           5           8     25000000     0.33333      3.00000

          102           6           8     25000000     0.33333      3.00000

          102           8           8     25000000     0.33333      3.00000

          102          13         503     25000000     0.33333      3.00000

          102          15           8     25000000     0.33333      3.00000

          102          16         503     25000000     0.33333      3.00000

          102          16           8     25000000     0.33333      3.00000

          102          17         503     25000000     0.33333      3.00000

          102          18         503     25000000     0.33333      3.00000

          102          19         503     25000000     0.33333      3.00000

          102          19           8     25000000     0.33333      3.00000

          102          23         503     25000000     0.33333      3.00000

          102          25         503     25000000     0.33333      3.00000

          102          31         503     25000000     0.33333      3.00000

          102          32         503     25000000     0.33333      3.00000

          102          32           8     25000000     0.33333      3.00000

          102          33         503     25000000     0.33333      3.00000

          102          35         503     25000000     0.33333      3.00000

          102          35           8     25000000     0.33333      3.00000

          102          36         503     25000000     0.33333      3.00000

          102          39         503     25000000     0.33333      3.00000

          102          44         503     25000000     0.33333      3.00000

          102          46         503     25000000     0.33333      3.00000

          102          46           8     25000000     0.33333      3.00000

          102          47         503     25000000     0.33333      3.00000

          102          48         503     25000000     0.33333      3.00000

          102          49           8     25000000     0.33333      3.00000

Note that for some replicates both lenders in Package 102 are selected, for others, only one lender is selected and some replicates (24 of them) are empty, for an average selection probability of one third.

PG

PG
Contributor
Posts: 62

Re: Random selection in a table

Sorry for the delay. Thank you Reeza, stat@sas and PG Stats.

PG Stats's response is pretty near of what I want.

More details:

I have approximatly 125 000 packageIDs so fifty replicates are huge for SAS to run.

More importantly, for the purpose of my further codes (which I have a good idea), my final table must only contain LenderID's numbers chosen randomly without any replicate. I don't know if it is possible.For example, if my samprate rate is 0,333333, can a code take all LenderIDs if a packageID has one or two LenderIDs (?) and if a packageID contains three or more LenderIDs, is it possible that the code rounds to the near number to select? (In this way, my code would choose one LenderID for a packageID of three LenderIDs, one LenderID for a packageID of four LenderIDs (round(1,33333)=1), two LenderIDs for a packageID of five LenderIDS (round(1,666666)=2) and so on.

Thank you so much for your help. When you start programming, it is hard to figure out what kind of codes work. I appreciate your guidance.

Solution
‎05-17-2014 04:47 PM
Respected Advisor
Posts: 4,919

Re: Random selection in a table

With that many PackageIDs, simply take 1 replicate (with method=BERNOUILLI) and you will get close to the required sampling rate, overall. That's what the SURVEYSELECT procedure is meant to do.  - PG

PG
Contributor
Posts: 62

Re: Random selection in a table

Thank you so much.

Contributor
Posts: 62

Re: Random selection in a table

Hello,

Sorry for asking again.

1) The resulting table excludes some packageID (91 152 packageIDs at the end VS 124 001 packageIDs at the start).

2) The number of lenders chosen randomly is also not exact. For example, if a packageID has 28 lenders, the resulting table contains 11 lenders when it should have 9 lenders. It's just one example. It seems that the number of lenders is overestimated in general.

Do you have any idea what can cause these inaccuracies?

Thank you in advance.

Respected Advisor
Posts: 4,919

Re: Random selection in a table

What is the total number of lenders before and after sampling?

PG
Contributor
Posts: 62

Re: Random selection in a table

Before the
sampling : 2065 distinct LenderIDs

After the
sampling : 2027 distinct LenderIDs

I wouldn’t be surprised if the number of lenderIDs after the sampling was 2055 for
example. Some lenders participate occasionally (under 10 packageID).  2027 is still possible.

Respected Advisor
Posts: 4,919

Re: Random selection in a table

Under method=BERNOUILLI the sample size in each stratum is a random variable which gives the requested sampling rate on average. Depending on the purpose of your sample, you may prefer method=SRS (for Simple Random Sampling) where the effective sample size is MAX(1, ROUND(N*samprate)), where N is the stratum size. Under SRS, strata with a single observation will always be sampled, strata with two observations will always have one and only one observation selected, and so on. If you have a large proportion of your strata with one or two observations, and are requesting a sampling rate lower than 50%, the effective sampling rate for these strata (100 and 50%) will necessarily inflate the overall sampling rate. In other words, small strata will be overrepresented in your sample. But you may prefer that over having random sample sizes and missing strata in your sample.

PG

PG
Contributor
Posts: 62

Re: Random selection in a table

Thank you so much!

Under SRS, my problems disappear:

1) I keep 124 001 packageIDs before and after the sampling.

2) The numbers of lenderID seem to be right.

One detail: If I want 28 x 0,33333= 9,33333, it keeps 10 lenderIDs. Maybe I would prefer that it keeps 9 lenderIDs but I can live with that.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 336 views
  • 8 likes
  • 4 in conversation