Splitting data set on specific criteria

Reply
Occasional Contributor
Posts: 9

Splitting data set on specific criteria

Hey everyone,

I have a data set where I am required to select french speaking customers and place them in Bucket 1. I am then required split the remainer of the customers between Bucket 1 and Bucket 2 so that both buckets have the same number of customers.

E.g.

Total Customers: 1000

     French spk: 200 (designated to Bucket 1)

Therefore, Bucket 1 gets 300 and Buckt 2 gets 500 of the remaining customers.

Thanks for all the help.

Super User
Posts: 10,550

Re: Splitting data set on specific criteria

Do the remaining customers have to be selected randomly?

Are you supposed to have 2 datasets as a result or add a variable?

If randomness isn't a serious concern sort the data so that the French speakers are first.

To add a bucket variable:

data want;

     set sorted;

     if French then bucket=1;

     else do;

          count+1;

          if count le 100 then bucket = 1;

          else bucket = 2;

     end;

     drop count;

run;

if two data sets

data bucket1 bucket2;

     set sorted;

     if French then output bucket1;

     else do;

          count+1;

          if count le 100 then output bucket1;

          else output bucket2;

     end;

     drop count;

run;

Occasional Contributor
Posts: 9

Re: Splitting data set on specific criteria

Q1: Sorry, I should have mentioned that the remaining customers do need to be selected randomly.

Q2: I should add a variable rather than have 2 separate data sets.

I'm new to coding but I follow what you've done so far - how would it change by conducting random selection of the remaining customers?

Thanks again!

Super User
Posts: 10,550

Re: Splitting data set on specific criteria

proc surveyselect data=have (where=(French='NO')) /* what ever would be needed to say none of the French speaking records*/

    out=want sampsize=500 outall;

run;

Will indicate 300 (sampsize) records as selected (a variable named Selected is added and values of 1 mean in that group). The remaining will be Selected=0;

Then add  the French back:

Data finalwant;

     set want

          have (where=(French='YES') in=infrench)

     ;

     if infrench then selected=0;

run;

the Selected =0 would be the French + bucket, 1 the other. You could rename Selected to Bucket if desired.

Occasional Contributor
Posts: 9

Re: Splitting data set on specific criteria

Thanks for the help!

I ended up using a hybrid of the stuff I found here to get my answer. It's probably not as efficient as what you guys did but I know I can at least explain what I'm doing.

Cheers!

Respected Advisor
Posts: 4,663

Re: Splitting data set on specific criteria

You could try something like this:

data temp;

call streaminit(167353);

set myCustomers;

if French then rnd = -1;

else rnd = rand("UNIFORM");

run;

proc sort data=temp; by rnd; run;

%let totalCust=1000;

data want;

set temp;

if 2*_n_ <= &totalCust. then do;

  bucket = 1;

  output;

  end;

else if _n_ <=  &totalCust. then do;

  bucket = 2;

  output;

  end;

drop rnd;

run;

(untested)

PG

Message was edited by: PG

PG
Ask a Question
Discussion stats
  • 5 replies
  • 300 views
  • 6 likes
  • 3 in conversation