Calcite | Level 5

## Random selection in a table

Hello,

I have a table that looks like that:

 PackageID LenderID DealAmount … 100 8 100000000 … 100 103 100000000 … 100 500 100000000 … 100 493 100000000 … 100 359 100000000 … 100 245 100000000 … 100 444 100000000 … 100 1234 100000000 … 101 2 254000000 … 101 8 254000000 … 101 103 254000000 … 101 87 254000000 … 101 237 254000000 … 102 503 25000000 … 102 8 25000000 … … … … …

I have a PackageID which is a number associated to a loan.

I have many lenders for a PackageID (LenderID is the identification number associated to a lender). The number of lenders by PackageID may vary between two and 100 approximately.

I want to test some hypothesis. For this purpose, I must select randomly half of the lenders by PackageID, the third of the lenders by PackageID, a quarter of the lenders by PackageID and so on.

What should I do? Most of the time, if I ask to select the third of the lenders (for example), it won't result in an integer for the number of lenders. How does I treat that in a code?

1 ACCEPTED SOLUTION

Accepted Solutions
Opal | Level 21

## Re: Random selection in a table

With that many PackageIDs, simply take 1 replicate (with method=BERNOUILLI) and you will get close to the required sampling rate, overall. That's what the SURVEYSELECT procedure is meant to do.  - PG

PG
11 REPLIES 11
Super User

## Re: Random selection in a table

Try looking into PROC SURVEYSELECT for starters.

Ammonite | Level 13

## Re: Random selection in a table

Try this one. You can vary "samprate" to get 25%, 50% and 75% samples stratified by PackageID

proc surveyselect data = have out = want method = srs samprate = .5 seed = 1230;

PackageID;

run;

Opal | Level 21

## Re: Random selection in a table

Building on suggestion, you may circumvent the small strata sizes problem with replicated sampling. The idea is to repeat sample selection so that on average you get the proportion you want. For example :

proc surveyselect data=have out=want method=BERNOULLI samprate=0.333333 seed=8757587 rep=50;

strata PackageID;

run;

proc print data=want noobs; where PackageID=102; run;

Package                 Lender      Deal      Selection    Sampling

ID      Replicate      ID       Amount        Prob       Weight

102           1           8     25000000     0.33333      3.00000

102           3         503     25000000     0.33333      3.00000

102           3           8     25000000     0.33333      3.00000

102           5         503     25000000     0.33333      3.00000

102           5           8     25000000     0.33333      3.00000

102           6           8     25000000     0.33333      3.00000

102           8           8     25000000     0.33333      3.00000

102          13         503     25000000     0.33333      3.00000

102          15           8     25000000     0.33333      3.00000

102          16         503     25000000     0.33333      3.00000

102          16           8     25000000     0.33333      3.00000

102          17         503     25000000     0.33333      3.00000

102          18         503     25000000     0.33333      3.00000

102          19         503     25000000     0.33333      3.00000

102          19           8     25000000     0.33333      3.00000

102          23         503     25000000     0.33333      3.00000

102          25         503     25000000     0.33333      3.00000

102          31         503     25000000     0.33333      3.00000

102          32         503     25000000     0.33333      3.00000

102          32           8     25000000     0.33333      3.00000

102          33         503     25000000     0.33333      3.00000

102          35         503     25000000     0.33333      3.00000

102          35           8     25000000     0.33333      3.00000

102          36         503     25000000     0.33333      3.00000

102          39         503     25000000     0.33333      3.00000

102          44         503     25000000     0.33333      3.00000

102          46         503     25000000     0.33333      3.00000

102          46           8     25000000     0.33333      3.00000

102          47         503     25000000     0.33333      3.00000

102          48         503     25000000     0.33333      3.00000

102          49           8     25000000     0.33333      3.00000

Note that for some replicates both lenders in Package 102 are selected, for others, only one lender is selected and some replicates (24 of them) are empty, for an average selection probability of one third.

PG

PG
Calcite | Level 5

## Re: Random selection in a table

Sorry for the delay. Thank you Reeza, stat@sas and PG Stats.

PG Stats's response is pretty near of what I want.

More details:

I have approximatly 125 000 packageIDs so fifty replicates are huge for SAS to run.

More importantly, for the purpose of my further codes (which I have a good idea), my final table must only contain LenderID's numbers chosen randomly without any replicate. I don't know if it is possible.For example, if my samprate rate is 0,333333, can a code take all LenderIDs if a packageID has one or two LenderIDs (?) and if a packageID contains three or more LenderIDs, is it possible that the code rounds to the near number to select? (In this way, my code would choose one LenderID for a packageID of three LenderIDs, one LenderID for a packageID of four LenderIDs (round(1,33333)=1), two LenderIDs for a packageID of five LenderIDS (round(1,666666)=2) and so on.

Thank you so much for your help. When you start programming, it is hard to figure out what kind of codes work. I appreciate your guidance.

Opal | Level 21

## Re: Random selection in a table

With that many PackageIDs, simply take 1 replicate (with method=BERNOUILLI) and you will get close to the required sampling rate, overall. That's what the SURVEYSELECT procedure is meant to do.  - PG

PG
Calcite | Level 5

## Re: Random selection in a table

Thank you so much.

Calcite | Level 5

## Re: Random selection in a table

Hello,

1) The resulting table excludes some packageID (91 152 packageIDs at the end VS 124 001 packageIDs at the start).

2) The number of lenders chosen randomly is also not exact. For example, if a packageID has 28 lenders, the resulting table contains 11 lenders when it should have 9 lenders. It's just one example. It seems that the number of lenders is overestimated in general.

Do you have any idea what can cause these inaccuracies?

Opal | Level 21

## Re: Random selection in a table

What is the total number of lenders before and after sampling?

PG
Calcite | Level 5

## Re: Random selection in a table

Before the
sampling : 2065 distinct LenderIDs

After the
sampling : 2027 distinct LenderIDs

I wouldn’t be surprised if the number of lenderIDs after the sampling was 2055 for
example. Some lenders participate occasionally (under 10 packageID).  2027 is still possible.

Opal | Level 21

## Re: Random selection in a table

Under method=BERNOUILLI the sample size in each stratum is a random variable which gives the requested sampling rate on average. Depending on the purpose of your sample, you may prefer method=SRS (for Simple Random Sampling) where the effective sample size is MAX(1, ROUND(N*samprate)), where N is the stratum size. Under SRS, strata with a single observation will always be sampled, strata with two observations will always have one and only one observation selected, and so on. If you have a large proportion of your strata with one or two observations, and are requesting a sampling rate lower than 50%, the effective sampling rate for these strata (100 and 50%) will necessarily inflate the overall sampling rate. In other words, small strata will be overrepresented in your sample. But you may prefer that over having random sample sizes and missing strata in your sample.

PG

PG
Calcite | Level 5

## Re: Random selection in a table

Thank you so much!

Under SRS, my problems disappear:

1) I keep 124 001 packageIDs before and after the sampling.

2) The numbers of lenderID seem to be right.

One detail: If I want 28 x 0,33333= 9,33333, it keeps 10 lenderIDs. Maybe I would prefer that it keeps 9 lenderIDs but I can live with that.

Discussion stats
• 11 replies
• 1609 views
• 8 likes
• 4 in conversation