Hi,
Please can you help me with the following programming task. I have an individual-level (1 row per person) dataset that contains a Group ID variable. Depending on the value of the Group ID, I would like to split the group using stratified random sampling (stratification variable: Score_Band). The groups that require splitting and the manner of the split are provided in a 'control' dataset, as follows:
StartGroup | EndGroup | Percent |
1 | 1 | 25 |
1 | 2 | 25 |
1 | 3 | 25 |
1 | 4 | 25 |
5 | 5 | 50 |
5 | 6 | 25 |
5 | 7 | 25 |
In the above example table, the subset of individuals belonging to group 1 are to be split across 4 groups (labelled 1 - 4), each containing 25% of the volume. Individuals belonging to group 5 are to be split into three groups (5 - 7), containing 50%, 25% and 25% of the volume, respectively.
I am trying to write some code that will take an input dataset (1 row per individual) that contains an initial GroupID variable, and maps it to a new GroupID variable, based on what is specified in the 'control' table. Each split is to be based on stratified random sampling (variable Score_Band).
Any help would be greatly appreciated,
Hoa
OR could be done more simpler.
data control;
infile cards expandtabs truncover;
input StartGroup EndGroup Percent;
cards;
1 1 25
1 2 25
1 3 25
1 4 25
5 5 50
5 6 25
5 7 25
;
run;
data control;
set control;
by startgroup;
if first.startgroup then cum=0;
cum+percent;
run;
data key;
set control;
by startgroup;
lag=lag(cum);
if first.startgroup then start=1;
else start=lag+1;
do i=start to cum;
output;
end;
drop lag start cum percent;
run;
/*Input dataset*/
data have;
do group=1,5;
do x=1 to 200;
output;
end;
end;
run;
proc surveyselect data=have out=temp groups=100;
strata group;
run;
proc sort data=temp;
by group groupid;
run;
data want;
merge temp key(rename=(startgroup=group i=groupid));
by group groupid;
run;
proc print;run;
Is it what you looking for.
data have;
infile cards expandtabs truncover;
input StartGroup EndGroup Percent;
cards;
1 1 .25
1 2 .25
1 3 .25
1 4 .25
5 5 .50
5 6 .25
5 7 .25
;
run;
proc surveyselect data=have out=want sampsize=100 outhits method=pps_wr;
strata startgroup;
size percent;
run;
proc print ;run;
I agree with the approach that @Ksharp suggested but, based on your requirements, think you might want to use a different method. I'd suggest:
proc surveyselect data=have out=want sampsize=100 outhits method=pps_sys; strata startgroup; size percent; run;
Art, CEO, AnalystFinder.com
Thank you both for your quick and helpful replies - I really appreciate it!
Apologies if I'm misunderstanding, but I'm wondering how I can apply the sampling logic you've provided to my input dataset. In addition to the 'control' table below...
Group | EndGroup | Percent |
1 | 1 | 0.25 |
1 | 2 | 0.25 |
1 | 3 | 0.25 |
1 | 4 | 0.25 |
5 | 5 | 0.5 |
5 | 6 | 0.25 |
5 | 7 | 0.25 |
I also have an input dataset (please see the attached file). The input dataset contains one row per individual. Based on the control table above, I would like to split the existing groups into sub-groups, using the stratified sampling procedure.
There are 42 individuals belonging to group 1 in the input table. I would like these to be split into 4 groups:
1 - containing roughly 25% of the original group 1
2 - containing roughly 25% of the original group 1
3 - containing roughly 25% of the original group 1
4 - containing roughly 25% of the original group 1
Similarly, there are 234 individuals belonging to group 5 in the input table. I would like these split into 3 groups:
5 - containing roughly 50% of the original group 5
6 - containing roughly 25% of the original group 5
7 - containing roughly 25% of the original group 5
On the input dataset, I have a variable called Score_Band (with possible values 'A', 'B' and 'C'). When I'm splitting the dataset, I like the sampling to be stratified using Score_Band.
I'm aiming to have an output dataset that is the same as the input dataset, but with a new group variable added - based on the new splits.
Thanks again for your help - I really appreciate it!
OK. How about this one ?
data control;
infile cards expandtabs truncover;
input StartGroup EndGroup Percent;
cards;
1 1 25
1 2 25
1 3 25
1 4 25
5 5 50
5 6 25
5 7 25
;
run;
data control;
set control;
by startgroup;
if first.startgroup then cum=0;
cum+percent;
run;
data key;
set control;
by startgroup;
lag=lag(cum);
if first.startgroup then start=1;
else start=lag+1;
do i=start to cum;
output;
end;
drop lag start cum percent;
run;
/*Input dataset*/
data have;
do group=1,5;
do x=1 to 200;
output;
end;
end;
run;
data have;
set have;
call streaminit(123456789);
random=rand('uniform');
run;
proc rank data=have out=temp groups=100 ;
by group;
var random;
ranks r;
run;
proc sort data=temp ;
by group r;
run;
data temp;
set temp;
i=r+1;
drop random r;
run;
data want;
merge temp key(rename=(startgroup=group));
by group i;
run;
proc print;run;
OR could be done more simpler.
data control;
infile cards expandtabs truncover;
input StartGroup EndGroup Percent;
cards;
1 1 25
1 2 25
1 3 25
1 4 25
5 5 50
5 6 25
5 7 25
;
run;
data control;
set control;
by startgroup;
if first.startgroup then cum=0;
cum+percent;
run;
data key;
set control;
by startgroup;
lag=lag(cum);
if first.startgroup then start=1;
else start=lag+1;
do i=start to cum;
output;
end;
drop lag start cum percent;
run;
/*Input dataset*/
data have;
do group=1,5;
do x=1 to 200;
output;
end;
end;
run;
proc surveyselect data=have out=temp groups=100;
strata group;
run;
proc sort data=temp;
by group groupid;
run;
data want;
merge temp key(rename=(startgroup=group i=groupid));
by group groupid;
run;
proc print;run;
Thank you very much for the quick response! I'll give it a try and will report back!
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.