Hi all!
I have a data set with 2million obs and I want to make 7 groups out of it (6 groups with 300,000 obs and one control group with 200,000 obs). It has to be a random selection, however it has to follow this rule:
There is 'country' input in the data and the random groups must have the same country percentages as the big data itself, i.e. if United States is 40% of the data, then all of these subgroups must have 40% United States in the country column.
Any idea is highly appreciated!
Thanks 🙂
Hi and welcome to the SAS Communities 🙂
Here is one way
/* Example Data */
data have;
array c{3} $20 _temporary_ ("United States", "Canada", "Italy");
do ID=1 to 2e6;
Country=c[ceil(rand("Uniform")*3)];
var1=rand('integer', 1, 100);
var2=rand('integer', 1, 100);
output;
end;
run;
proc sort data=have;
by Country;
run;
proc surveyselect data=have seed=125 out=one n=200000 outall noprint;
strata Country / alloc=prop;
run;
data control;
set one;
where Selected=1;
GroupID=0;
keep ID Country var1 var2 GroupID;
run;
proc surveyselect data=one(where=(Selected=0)) seed=123 out=two noprint
groups=(3e5, 3e5, 3e5, 3e5, 3e5, 3e5);
run;
data want;
set control two(keep=ID Country var1 var2 GroupID);
run;
proc sort data=want;
by GroupID;
run;
Maybe rand("table")?
data have;
array ccc[0:5] $ 20 _TEMPORARY_ ("Poland", "Australia", "UK", "Japan", "Mexico", "Canada");
do I = 1 to 2e6;
country = ccc[mod(I, 6)];
output;
end;
run;
proc sort
data = have
out = tmp
;
by country;
run;
data want;
set tmp;
call streaminit(123);
GRP = RAND('TABLE', 3/20,3/20,3/20,3/20,3/20,3/20,2/20);
run;
proc freq data = want;
table country * GRP;
run;
all the best
Bart
Hi and welcome to the SAS Communities 🙂
Here is one way
/* Example Data */
data have;
array c{3} $20 _temporary_ ("United States", "Canada", "Italy");
do ID=1 to 2e6;
Country=c[ceil(rand("Uniform")*3)];
var1=rand('integer', 1, 100);
var2=rand('integer', 1, 100);
output;
end;
run;
proc sort data=have;
by Country;
run;
proc surveyselect data=have seed=125 out=one n=200000 outall noprint;
strata Country / alloc=prop;
run;
data control;
set one;
where Selected=1;
GroupID=0;
keep ID Country var1 var2 GroupID;
run;
proc surveyselect data=one(where=(Selected=0)) seed=123 out=two noprint
groups=(3e5, 3e5, 3e5, 3e5, 3e5, 3e5);
run;
data want;
set control two(keep=ID Country var1 var2 GroupID);
run;
proc sort data=want;
by GroupID;
run;
this worked perfectly, thanks a lot!
Anytime, glad to help 🙂
I would start with something like:
proc surveyselect data=have out=want groups=(300000, 300000, 300000, 300000, 300000, 300000, 200000); run;
1. Use PROC SURVEYSELECT
2. Since you want to match by proportion of country, make your strata country.
The documentation has a fully worked example of proportional allocation here:
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Check out this tutorial series to learn how to build your own steps in SAS Studio.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.