BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
windyboo
Calcite | Level 5

Hello,

I have a table that looks like that:

PackageIDLenderIDDealAmount
1008100000000
100103100000000
100500100000000
100493100000000
100359100000000
100245100000000
100444100000000
1001234100000000
1012254000000
1018254000000
101103254000000
10187254000000
101237254000000
10250325000000
102825000000

I have a PackageID which is a number associated to a loan.

I have many lenders for a PackageID (LenderID is the identification number associated to a lender). The number of lenders by PackageID may vary between two and 100 approximately.

I want to test some hypothesis. For this purpose, I must select randomly half of the lenders by PackageID, the third of the lenders by PackageID, a quarter of the lenders by PackageID and so on.

What should I do? Most of the time, if I ask to select the third of the lenders (for example), it won't result in an integer for the number of lenders. How does I treat that in a code?

Thank you in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

With that many PackageIDs, simply take 1 replicate (with method=BERNOUILLI) and you will get close to the required sampling rate, overall. That's what the SURVEYSELECT procedure is meant to do.  - PG

PG

View solution in original post

11 REPLIES 11
Reeza
Super User

Try looking into PROC SURVEYSELECT for starters.

stat_sas
Ammonite | Level 13

Try this one. You can vary "samprate" to get 25%, 50% and 75% samples stratified by PackageID

proc surveyselect data = have out = want method = srs samprate = .5 seed = 1230;

PackageID;

run;

PGStats
Opal | Level 21

Building on suggestion, you may circumvent the small strata sizes problem with replicated sampling. The idea is to repeat sample selection so that on average you get the proportion you want. For example :

proc surveyselect data=have out=want method=BERNOULLI samprate=0.333333 seed=8757587 rep=50;

strata PackageID;

run;

proc print data=want noobs; where PackageID=102; run;

       Package                 Lender      Deal      Selection    Sampling

           ID      Replicate      ID       Amount        Prob       Weight

          102           1           8     25000000     0.33333      3.00000

          102           3         503     25000000     0.33333      3.00000

          102           3           8     25000000     0.33333      3.00000

          102           5         503     25000000     0.33333      3.00000

          102           5           8     25000000     0.33333      3.00000

          102           6           8     25000000     0.33333      3.00000

          102           8           8     25000000     0.33333      3.00000

          102          13         503     25000000     0.33333      3.00000

          102          15           8     25000000     0.33333      3.00000

          102          16         503     25000000     0.33333      3.00000

          102          16           8     25000000     0.33333      3.00000

          102          17         503     25000000     0.33333      3.00000

          102          18         503     25000000     0.33333      3.00000

          102          19         503     25000000     0.33333      3.00000

          102          19           8     25000000     0.33333      3.00000

          102          23         503     25000000     0.33333      3.00000

          102          25         503     25000000     0.33333      3.00000

          102          31         503     25000000     0.33333      3.00000

          102          32         503     25000000     0.33333      3.00000

          102          32           8     25000000     0.33333      3.00000

          102          33         503     25000000     0.33333      3.00000

          102          35         503     25000000     0.33333      3.00000

          102          35           8     25000000     0.33333      3.00000

          102          36         503     25000000     0.33333      3.00000

          102          39         503     25000000     0.33333      3.00000

          102          44         503     25000000     0.33333      3.00000

          102          46         503     25000000     0.33333      3.00000

          102          46           8     25000000     0.33333      3.00000

          102          47         503     25000000     0.33333      3.00000

          102          48         503     25000000     0.33333      3.00000

          102          49           8     25000000     0.33333      3.00000

Note that for some replicates both lenders in Package 102 are selected, for others, only one lender is selected and some replicates (24 of them) are empty, for an average selection probability of one third.

PG

PG
windyboo
Calcite | Level 5

Sorry for the delay. Thank you Reeza, stat@sas and PG Stats.

PG Stats's response is pretty near of what I want.

More details:

I have approximatly 125 000 packageIDs so fifty replicates are huge for SAS to run.

More importantly, for the purpose of my further codes (which I have a good idea), my final table must only contain LenderID's numbers chosen randomly without any replicate. I don't know if it is possible.For example, if my samprate rate is 0,333333, can a code take all LenderIDs if a packageID has one or two LenderIDs (?) and if a packageID contains three or more LenderIDs, is it possible that the code rounds to the near number to select? (In this way, my code would choose one LenderID for a packageID of three LenderIDs, one LenderID for a packageID of four LenderIDs (round(1,33333)=1), two LenderIDs for a packageID of five LenderIDS (round(1,666666)=2) and so on.

Thank you so much for your help. When you start programming, it is hard to figure out what kind of codes work. I appreciate your guidance.

PGStats
Opal | Level 21

With that many PackageIDs, simply take 1 replicate (with method=BERNOUILLI) and you will get close to the required sampling rate, overall. That's what the SURVEYSELECT procedure is meant to do.  - PG

PG
windyboo
Calcite | Level 5

Thank you so much.

windyboo
Calcite | Level 5

Hello,

Sorry for asking again.

1) The resulting table excludes some packageID (91 152 packageIDs at the end VS 124 001 packageIDs at the start).

2) The number of lenders chosen randomly is also not exact. For example, if a packageID has 28 lenders, the resulting table contains 11 lenders when it should have 9 lenders. It's just one example. It seems that the number of lenders is overestimated in general.

Do you have any idea what can cause these inaccuracies?

Thank you in advance.

PGStats
Opal | Level 21

What is the total number of lenders before and after sampling?

PG
windyboo
Calcite | Level 5

Before the
sampling : 2065 distinct LenderIDs

After the
sampling : 2027 distinct LenderIDs

I wouldn’t be surprised if the number of lenderIDs after the sampling was 2055 for
example. Some lenders participate occasionally (under 10 packageID).  2027 is still possible.

PGStats
Opal | Level 21

Under method=BERNOUILLI the sample size in each stratum is a random variable which gives the requested sampling rate on average. Depending on the purpose of your sample, you may prefer method=SRS (for Simple Random Sampling) where the effective sample size is MAX(1, ROUND(N*samprate)), where N is the stratum size. Under SRS, strata with a single observation will always be sampled, strata with two observations will always have one and only one observation selected, and so on. If you have a large proportion of your strata with one or two observations, and are requesting a sampling rate lower than 50%, the effective sampling rate for these strata (100 and 50%) will necessarily inflate the overall sampling rate. In other words, small strata will be overrepresented in your sample. But you may prefer that over having random sample sizes and missing strata in your sample.

PG

PG
windyboo
Calcite | Level 5

Thank you so much!

Under SRS, my problems disappear:

1) I keep 124 001 packageIDs before and after the sampling.

2) The numbers of lenderID seem to be right.

One detail: If I want 28 x 0,33333= 9,33333, it keeps 10 lenderIDs. Maybe I would prefer that it keeps 9 lenderIDs but I can live with that.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 3470 views
  • 8 likes
  • 4 in conversation