DATA Step, Macro, Functions and more

Random distribution of records

Reply
Contributor
Posts: 71

Random distribution of records

Hi All,

I'm kind off stuck in a middle of a code and need some help. I have a population where it needs to be split 60/40. Then the 40% population should be evenly distributed(50/50) based on Rank variable. Any left over records after 50/50 split should be deleted.

So for example I have

27 total records

14 Records- segA

13 Records- SegB

The SegB population has total 13 records out of which 10 records with Rank = 'Y' and 3 records with Rank = 'N'. Now the 13 records should be evenly split based on Rank flag. Meaning Randomly select 3 accounts from Rank = 'Y' and delete the remaining 7 left over records.

So if i have 13 records in the segB population

10 - Y's

3 - N's

I would expect to see 6 total records

3 Y's(remove 7 Y records randomly)

3 N's

So my total final volume would be 20(14+6) records.

Sample data set

Data vendor_seg;

Infile cards;

Input ID $ 1-4 Rank $ 6;

cards;

1771 Y

3456 Y

5134 N

9087 Y

2829 Y

2749 Y

9987 Y

3094 Y

2938 N

8320 Y

5681 Y

2922 N

0987 Y

1108 N

3879 N

8734 N

1930 N

4752 Y

5609 N

7541 Y

8745 Y

9109 N

5904 Y

7145 Y

0876 N

2939 Y

7340 Y

;

Run;

Data vendor_seg2;

Set vendor_seg;

ranspt = ranuni(1239);

if ranspt < 0.6 then popul = 'SegA';else popu = 'SegB';

Run;

Now the catch here is sometimes Rank = 'N' count can be higher in segB population. So I have to take the lowest Rank record count and adjust the other rank volume accordingly to get a even 50/50 count.

I tried to do it in multiple steps by counting no of records in Rank = N population and using _n_ to delete the Y records from segB but it wasn't working. I am sure there is a best way to do this and you experts can get this all done in a single data step!.

Thanks in advance  for your help!!

Respected Advisor
Posts: 4,662

Re: Random distribution of records

Compute the required sampling rates in each strata (Y and N), use surveyselect to draw the sample and use a WHERE clause to drop the strata=N from the selected sample:

  
data vendor_seg;
Input ID $ 1-4 Rank $ 6;
datalines;
1771 Y
3456 Y
5134 N
9087 Y
2829 Y
2749 Y
9987 Y
3094 Y
2938 N
8320 Y
5681 Y
2922 N
0987 Y
1108 N
3879 N
8734 N
1930 N
4752 Y
5609 N
7541 Y
8745 Y
9109 N
5904 Y
7145 Y
0876 N
2939 Y
7340 Y
;

proc sort data=vendor_seg; by rank; run;

proc sql;
create table strataRate as
select

     rank,

     0.2*sum(n)/n as _rate_
from

     (select rank, count(*) as n from vendor_seg group by rank);

proc surveyselect data=vendor_seg seed=1239
     rate=strataRate out=sample(where=(rank="Y"));
strata rank;
run;

PG

PG
Ask a Question
Discussion stats
  • 1 reply
  • 278 views
  • 0 likes
  • 2 in conversation