Solved: Re: Random selection in a table

windyboo · Posted 05-09-2014 07:53 PM

Hello,

I have a table that looks like that:

PackageID	LenderID	DealAmount	…
100	8	100000000	…
100	103	100000000	…
100	500	100000000	…
100	493	100000000	…
100	359	100000000	…
100	245	100000000	…
100	444	100000000	…
100	1234	100000000	…
101	2	254000000	…
101	8	254000000	…
101	103	254000000	…
101	87	254000000	…
101	237	254000000	…
102	503	25000000	…
102	8	25000000	…
…	…	…	…

I have a PackageID which is a number associated to a loan.

I have many lenders for a PackageID (LenderID is the identification number associated to a lender). The number of lenders by PackageID may vary between two and 100 approximately.

I want to test some hypothesis. For this purpose, I must select randomly half of the lenders by PackageID, the third of the lenders by PackageID, a quarter of the lenders by PackageID and so on.

What should I do? Most of the time, if I ask to select the third of the lenders (for example), it won't result in an integer for the number of lenders. How does I treat that in a code?

Thank you in advance!

PGStats · Posted 05-17-2014 04:47 PM

With that many PackageIDs, simply take 1 replicate (with method=BERNOUILLI) and you will get close to the required sampling rate, overall. That's what the SURVEYSELECT procedure is meant to do. - PG

PG

View solution in original post

Reeza · Posted 05-09-2014 08:11 PM

Try looking into PROC SURVEYSELECT for starters.

stat_sas · Posted 05-09-2014 08:39 PM

Try this one. You can vary "samprate" to get 25%, 50% and 75% samples stratified by PackageID

proc surveyselect data = have out = want method = srs samprate = .5 seed = 1230;

PackageID;

run;

PGStats · Posted 05-09-2014 10:31 PM

Building on suggestion, you may circumvent the small strata sizes problem with replicated sampling. The idea is to repeat sample selection so that on average you get the proportion you want. For example :

proc surveyselect data=have out=want method=BERNOULLI samprate=0.333333 seed=8757587 rep=50;

strata PackageID;

run;

proc print data=want noobs; where PackageID=102; run;

Package Lender Deal Selection Sampling

ID Replicate ID Amount Prob Weight

102 1 8 25000000 0.33333 3.00000

102 3 503 25000000 0.33333 3.00000

102 3 8 25000000 0.33333 3.00000

102 5 503 25000000 0.33333 3.00000

102 5 8 25000000 0.33333 3.00000

102 6 8 25000000 0.33333 3.00000

102 8 8 25000000 0.33333 3.00000

102 13 503 25000000 0.33333 3.00000

102 15 8 25000000 0.33333 3.00000

102 16 503 25000000 0.33333 3.00000

102 16 8 25000000 0.33333 3.00000

102 17 503 25000000 0.33333 3.00000

102 18 503 25000000 0.33333 3.00000

102 19 503 25000000 0.33333 3.00000

102 19 8 25000000 0.33333 3.00000

102 23 503 25000000 0.33333 3.00000

102 25 503 25000000 0.33333 3.00000

102 31 503 25000000 0.33333 3.00000

102 32 503 25000000 0.33333 3.00000

102 32 8 25000000 0.33333 3.00000

102 33 503 25000000 0.33333 3.00000

102 35 503 25000000 0.33333 3.00000

102 35 8 25000000 0.33333 3.00000

102 36 503 25000000 0.33333 3.00000

102 39 503 25000000 0.33333 3.00000

102 44 503 25000000 0.33333 3.00000

102 46 503 25000000 0.33333 3.00000

102 46 8 25000000 0.33333 3.00000

102 47 503 25000000 0.33333 3.00000

102 48 503 25000000 0.33333 3.00000

102 49 8 25000000 0.33333 3.00000

Note that for some replicates both lenders in Package 102 are selected, for others, only one lender is selected and some replicates (24 of them) are empty, for an average selection probability of one third.

PG

windyboo · Posted 05-17-2014 01:43 PM

Sorry for the delay. Thank you Reeza, stat@sas and PG Stats.

PG Stats's response is pretty near of what I want.

More details:

I have approximatly 125 000 packageIDs so fifty replicates are huge for SAS to run.

More importantly, for the purpose of my further codes (which I have a good idea), my final table must only contain LenderID's numbers chosen randomly without any replicate. I don't know if it is possible.For example, if my samprate rate is 0,333333, can a code take all LenderIDs if a packageID has one or two LenderIDs (?) and if a packageID contains three or more LenderIDs, is it possible that the code rounds to the near number to select? (In this way, my code would choose one LenderID for a packageID of three LenderIDs, one LenderID for a packageID of four LenderIDs (round(1,33333)=1), two LenderIDs for a packageID of five LenderIDS (round(1,666666)=2) and so on.

Thank you so much for your help. When you start programming, it is hard to figure out what kind of codes work. I appreciate your guidance.

PGStats · Posted 05-17-2014 04:47 PM

With that many PackageIDs, simply take 1 replicate (with method=BERNOUILLI) and you will get close to the required sampling rate, overall. That's what the SURVEYSELECT procedure is meant to do. - PG

PG

windyboo · Posted 05-17-2014 06:51 PM

Thank you so much.

windyboo · Posted 05-17-2014 09:36 PM

Hello,

Sorry for asking again.

1) The resulting table excludes some packageID (91 152 packageIDs at the end VS 124 001 packageIDs at the start).

2) The number of lenders chosen randomly is also not exact. For example, if a packageID has 28 lenders, the resulting table contains 11 lenders when it should have 9 lenders. It's just one example. It seems that the number of lenders is overestimated in general.

Do you have any idea what can cause these inaccuracies?

Thank you in advance.

PGStats · Posted 05-17-2014 10:14 PM

What is the total number of lenders before and after sampling?

PG

windyboo · Posted 05-19-2014 12:59 PM

Before the
sampling : 2065 distinct LenderIDs

After the
sampling : 2027 distinct LenderIDs

I wouldn’t be surprised if the number of lenderIDs after the sampling was 2055 for
example. Some lenders participate occasionally (under 10 packageID). 2027 is still possible.

PGStats · Posted 05-17-2014 11:51 PM

Under method=BERNOUILLI the sample size in each stratum is a random variable which gives the requested sampling rate on average. Depending on the purpose of your sample, you may prefer method=SRS (for Simple Random Sampling) where the effective sample size is MAX(1, ROUND(N*samprate)), where N is the stratum size. Under SRS, strata with a single observation will always be sampled, strata with two observations will always have one and only one observation selected, and so on. If you have a large proportion of your strata with one or two observations, and are requesting a sampling rate lower than 50%, the effective sampling rate for these strata (100 and 50%) will necessarily inflate the overall sampling rate. In other words, small strata will be overrepresented in your sample. But you may prefer that over having random sample sizes and missing strata in your sample.

PG

windyboo · Posted 05-19-2014 11:02 AM

Thank you so much!

Under SRS, my problems disappear:

1) I keep 124 001 packageIDs before and after the sampling.

2) The numbers of lenderID seem to be right.

One detail: If I want 28 x 0,33333= 9,33333, it keeps 10 lenderIDs. Maybe I would prefer that it keeps 9 lenderIDs but I can live with that.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away