Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- Random distribution of records

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-05-2013 07:26 PM

Hi All,

I'm kind off stuck in a middle of a code and need some help. I have a population where it needs to be split 60/40. Then the 40% population should be evenly distributed(50/50) based on Rank variable. Any left over records after 50/50 split should be deleted.

So for example I have

27 total records

14 Records- segA

13 Records- SegB

The SegB population has total 13 records out of which 10 records with Rank = 'Y' and 3 records with Rank = 'N'. Now the 13 records should be evenly split based on Rank flag. Meaning Randomly select 3 accounts from Rank = 'Y' and delete the remaining 7 left over records.

So if i have 13 records in the segB population

10 - Y's

3 - N's

I would expect to see 6 total records

3 Y's(remove 7 Y records randomly)

3 N's

So my total final volume would be 20(14+6) records.

Sample data set

Data vendor_seg;

Infile cards;

Input ID $ 1-4 Rank $ 6;

cards;

1771 Y

3456 Y

5134 N

9087 Y

2829 Y

2749 Y

9987 Y

3094 Y

2938 N

8320 Y

5681 Y

2922 N

0987 Y

1108 N

3879 N

8734 N

1930 N

4752 Y

5609 N

7541 Y

8745 Y

9109 N

5904 Y

7145 Y

0876 N

2939 Y

7340 Y

;

Run;

Data vendor_seg2;

Set vendor_seg;

ranspt = ranuni(1239);

if ranspt < 0.6 then popul = 'SegA';else popu = 'SegB';

Run;

Now the catch here is sometimes Rank = 'N' count can be higher in segB population. So I have to take the lowest Rank record count and adjust the other rank volume accordingly to get a even 50/50 count.

I tried to do it in multiple steps by counting no of records in Rank = N population and using _n_ to delete the Y records from segB but it wasn't working. I am sure there is a best way to do this and you experts can get this all done in a single data step!.

Thanks in advance for your help!!

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to vicky07

06-05-2013 11:01 PM

Compute the required sampling rates in each strata (Y and N), use surveyselect to draw the sample and use a WHERE clause to drop the strata=N from the selected sample:

**data vendor_seg;****Input ID $ 1-4 Rank $ 6;****datalines;****1771 Y****3456 Y****5134 N****9087 Y****2829 Y****2749 Y****9987 Y****3094 Y****2938 N****8320 Y****5681 Y****2922 N****0987 Y****1108 N****3879 N****8734 N****1930 N****4752 Y****5609 N****7541 Y****8745 Y****9109 N****5904 Y****7145 Y****0876 N****2939 Y****7340 Y****;**

**proc sort data=vendor_seg; by rank; run;**

**proc sql;****create table strataRate as****select **

** rank, **

** 0.2*sum(n)/n as _rate_****from**

** (select rank, count(*) as n from vendor_seg group by rank);**

**proc surveyselect data=vendor_seg seed=1239**** rate=strataRate out=sample(where=(rank="Y"));****strata rank;****run;**

PG

PG