BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sidello
Calcite | Level 5

Hi; 

 

I have a sample of 100 participants, categorized in three categories (1,2,3). Within each category, pairs are made (based upon several characteristics). 

I now wan't to randomize all participants into two groups, taking into account the pair (one participant of each pair) and the categories (an equal number within each category). 

What is the best way to approach this? 

 

label part="part" cat="cat" pair="pair";
datalines;
1 1 1
2 3 2
3 2 3
4 2 4
5 3 5
6 3 6
7 3 7
8 1 8
9 1 9
10 1 10
11 1 1
12 1 2
13 2 3
14 3 4
15 2 5
16 2 6
17 3 7
18 2 8
19 2 9
20 2 1

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @sidello,

 

Let's assume your input dataset (created without the DSD option of the INFILE statement) was named HAVE. My understanding is that you want to split each pair so as to assign the two participants to two groups (say group 0 and group 1) which are as balanced as possible regarding the frequency distributions of categorical variable CAT. My suggested solution to this problem would look something like this:

/* Rearrange the data to have one observation per pair */

proc sql;
create view _tmp
as select * from have
order by pair, part;
quit;

data pairs(drop=part cat);
call streaminit(27182818);
do until(last.pair);
  set _tmp;
  by pair;
  if first.pair then do;
    _p0=part;
    _c0=cat;
  end;
  else do;
    _p1=part;
    _c1=cat;
  end;
end;
_r=rand('uniform');
run;

proc sql;
drop view _tmp;
quit;

/* Sort dataset PAIRS randomly */

proc sort data=pairs out=pairs;
by _r;
run;

/* Assign participants to two groups (0 and 1) with balanced
   numbers of participants per category */

%let nc=3; /* number of categories (1, 2, ..., &nc) */

data want(drop=_:);
call streaminit(3141592);
array _n[&nc,0:1] _temporary_ (%eval(2*&nc)*0);
/* _n[c, g] is the number of participants of category c in group g. */
set pairs;
/* Check how either of the two possible assignments affects the balance: */
_d0=(_n[_c0,0]+1-_n[_c0,1])**2+(_n[_c1,1]+1-_n[_c1,0])**2;
_d1=(_n[_c0,1]+1-_n[_c0,0])**2+(_n[_c1,0]+1-_n[_c1,1])**2;
part=_p0;
cat=_c0;
/* Prefer an assignment which improves balance (if any), otherwise assign randomly: */
if _d0<_d1 then group=0;
else if _d1<_d0 then group=1;
else group=rand('bern',0.5);
_n[cat,group]+1;
output;
part=_p1;
cat=_c1;
group=1-group;
_n[cat,group]+1;
output;
run;

proc sort data=want;
by part pair;
run;

/* Check if groups are balanced in each category */

proc freq data=want;
tables group*cat;
run;

View solution in original post

4 REPLIES 4
Spuzzz
Calcite | Level 5

Difficult to discern what you want your end result to be.  Your example data has 20 participants (I assume the last PAIR value is supposed to be 10, not 1).   You want to randomly assign them into 2 groups (based on the PAIR ID I think?).  And then further eliminate participants so you have an equal number of participants in each category/pair combination?  If I got that right, then here is one approach.  There are many different ways this could be solved and I'm sure mine is not the simplest. 

 

Using this small dataset of 20 you can play with the RANUNI seed value...sometimes you get a sample size of 2, other times a sample size of 1 depending upon how the random deck is shuffled.  A larger dataset will obviously yield higher sample sizes.  (Sorry the cutting and pasting of code didn't maintain my indents - not super readable)

 

DATA FULLSET;
INFILE CARDS;
label part="part" cat="cat" pair="pair";
INPUT PART CAT PAIR;
SORTVAR=RANUNI(13); *** THE SEED VALUE HERE IS ARBITRARY. CHANGING IT WILL SHUFFLE THE DECK DIFFERENTLY;
CARDS;
1 1 1
2 3 2
3 2 3
4 2 4
5 3 5
6 3 6
7 3 7
8 1 8
9 1 9
10 1 10
11 1 1
12 1 2
13 2 3
14 3 4
15 2 5
16 2 6
17 3 7
18 2 8
19 2 9
20 2 10
;
RUN;

 

PROC SORT DATA=FULLSET;
BY PAIR SORTVAR;
RUN;

 

**** RANDOMLY SEPARATE INTO 2 GROUPS BASED ON PAIR;

 

DATA FULLSET;
SET FULLSET;
BY PAIR;
IF FIRST.PAIR THEN PAIRGROUP=1;
ELSE PAIRGROUP=2;
RUN;

 

*** SUMMARIZE DATASET TO DETERMINE LOWEST FREQUENCY OF CATEGORIES FOUND;

PROC SUMMARY NWAY DATA=FULLSET;
CLASS PAIRGROUP CAT;
OUTPUT OUT=FULLSET_SUM;
RUN;

 

*** SORT SO THE LOWEST NUMBER OF PARTICPANTS IS ON TOP;
PROC SORT DATA=FULLSET_SUM;
BY _FREQ_;
RUN;

 

*** NOW EXTRACT THAT FIRST OBSERVATION AND WRITE OUT TO A MACRO VARIABLE...THIS WILL BE
THE NUMBER WE USE TO STRATIFY OUR SAMPLE;

DATA _NULL_;
SET FULLSET_SUM(OBS=1);
CALL SYMPUT("STRATSAMPSIZE",_FREQ_);
RUN;

%PUT USING A VALUE OF &STRATSAMPSIZE. FOR STRATIFIED SAMPLE SIZE;


**** RE-SORT FULLSET IN ORDER OF PAIRGROUP AND CATEGORY;
PROC SORT DATA=FULLSET;
BY PAIRGROUP CAT SORTVAR;
RUN;

DATA SAMPLE;
SET FULLSET;
BY PAIRGROUP CAT;
SAMPCOUNT+1;
IF FIRST.PAIRGROUP OR FIRST.CAT THEN SAMPCOUNT=1;
IF SAMPCOUNT LE &STRATSAMPSIZE.;
DROP SORTVAR SAMPCOUNT;
RUN;


*** VERIFIY SAMPLE SIZES ARE WHAT WE WANT - EQUAL ACROSS ALL CATEGORIES/PAIR COMBINATIONS;
PROC FREQ DATA=SAMPLE;
TABLES CAT*PAIRGROUP /NOROW NOCOL NOPERCENT;
RUN;

 

 

PaigeMiller
Diamond | Level 26

@sidello wrote:

Hi; 

 

I have a sample of 100 participants, categorized in three categories (1,2,3). Within each category, pairs are made (based upon several characteristics). 

I now wan't to randomize all participants into two groups, taking into account the pair (one participant of each pair) and the categories (an equal number within each category). 

What is the best way to approach this? 

 

label part="part" cat="cat" pair="pair";
datalines;
1 1 1
2 3 2
3 2 3
4 2 4
5 3 5
6 3 6
7 3 7
8 1 8
9 1 9
10 1 10
11 1 1
12 1 2
13 2 3
14 3 4
15 2 5
16 2 6
17 3 7
18 2 8
19 2 9
20 2 1


Please turn this into actual working DATA step code. You're almost there. Once we have actual working DATA step code, we can produce code that will run on this data. Thanks!

--
Paige Miller
sidello
Calcite | Level 5

Hi, 

 

@PaigeMiller i'm sorry I missed a part. Is this the complete datastep to work with? (I follow the steps suggested while making the post). 

 

data WORK.DATA;
infile datalines dsd truncover;
input part:BEST. cat:BEST. pair:BEST.;
format part BEST. cat BEST. pair BEST.;
label part="part" cat="cat" pair="pair";
datalines;
1 1 1
2 3 2
3 2 3
4 2 4
5 3 5
6 3 6
7 3 7
8 1 8
9 1 9
10 1 10
11 1 1
12 1 2
13 2 3
14 3 4
15 2 5
16 2 6
17 3 7
18 2 8
19 2 9
20 2 10
;;;;

FreelanceReinh
Jade | Level 19

Hi @sidello,

 

Let's assume your input dataset (created without the DSD option of the INFILE statement) was named HAVE. My understanding is that you want to split each pair so as to assign the two participants to two groups (say group 0 and group 1) which are as balanced as possible regarding the frequency distributions of categorical variable CAT. My suggested solution to this problem would look something like this:

/* Rearrange the data to have one observation per pair */

proc sql;
create view _tmp
as select * from have
order by pair, part;
quit;

data pairs(drop=part cat);
call streaminit(27182818);
do until(last.pair);
  set _tmp;
  by pair;
  if first.pair then do;
    _p0=part;
    _c0=cat;
  end;
  else do;
    _p1=part;
    _c1=cat;
  end;
end;
_r=rand('uniform');
run;

proc sql;
drop view _tmp;
quit;

/* Sort dataset PAIRS randomly */

proc sort data=pairs out=pairs;
by _r;
run;

/* Assign participants to two groups (0 and 1) with balanced
   numbers of participants per category */

%let nc=3; /* number of categories (1, 2, ..., &nc) */

data want(drop=_:);
call streaminit(3141592);
array _n[&nc,0:1] _temporary_ (%eval(2*&nc)*0);
/* _n[c, g] is the number of participants of category c in group g. */
set pairs;
/* Check how either of the two possible assignments affects the balance: */
_d0=(_n[_c0,0]+1-_n[_c0,1])**2+(_n[_c1,1]+1-_n[_c1,0])**2;
_d1=(_n[_c0,1]+1-_n[_c0,0])**2+(_n[_c1,0]+1-_n[_c1,1])**2;
part=_p0;
cat=_c0;
/* Prefer an assignment which improves balance (if any), otherwise assign randomly: */
if _d0<_d1 then group=0;
else if _d1<_d0 then group=1;
else group=rand('bern',0.5);
_n[cat,group]+1;
output;
part=_p1;
cat=_c1;
group=1-group;
_n[cat,group]+1;
output;
run;

proc sort data=want;
by part pair;
run;

/* Check if groups are balanced in each category */

proc freq data=want;
tables group*cat;
run;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1740 views
  • 0 likes
  • 4 in conversation