Hi all!
I have a data set with 2million obs and I want to make 7 groups out of it (6 groups with 300,000 obs and one control group with 200,000 obs). It has to be a random selection, however it has to follow this rule:
There is 'country' input in the data and the random groups must have the same country percentages as the big data itself, i.e. if United States is 40% of the data, then all of these subgroups must have 40% United States in the country column.
Any idea is highly appreciated!
Thanks 🙂
Hi and welcome to the SAS Communities 🙂
Here is one way
/* Example Data */
data have;
array c{3} $20 _temporary_ ("United States", "Canada", "Italy");
do ID=1 to 2e6;
Country=c[ceil(rand("Uniform")*3)];
var1=rand('integer', 1, 100);
var2=rand('integer', 1, 100);
output;
end;
run;
proc sort data=have;
by Country;
run;
proc surveyselect data=have seed=125 out=one n=200000 outall noprint;
strata Country / alloc=prop;
run;
data control;
set one;
where Selected=1;
GroupID=0;
keep ID Country var1 var2 GroupID;
run;
proc surveyselect data=one(where=(Selected=0)) seed=123 out=two noprint
groups=(3e5, 3e5, 3e5, 3e5, 3e5, 3e5);
run;
data want;
set control two(keep=ID Country var1 var2 GroupID);
run;
proc sort data=want;
by GroupID;
run;
Maybe rand("table")?
data have;
array ccc[0:5] $ 20 _TEMPORARY_ ("Poland", "Australia", "UK", "Japan", "Mexico", "Canada");
do I = 1 to 2e6;
country = ccc[mod(I, 6)];
output;
end;
run;
proc sort
data = have
out = tmp
;
by country;
run;
data want;
set tmp;
call streaminit(123);
GRP = RAND('TABLE', 3/20,3/20,3/20,3/20,3/20,3/20,2/20);
run;
proc freq data = want;
table country * GRP;
run;
all the best
Bart
Hi and welcome to the SAS Communities 🙂
Here is one way
/* Example Data */
data have;
array c{3} $20 _temporary_ ("United States", "Canada", "Italy");
do ID=1 to 2e6;
Country=c[ceil(rand("Uniform")*3)];
var1=rand('integer', 1, 100);
var2=rand('integer', 1, 100);
output;
end;
run;
proc sort data=have;
by Country;
run;
proc surveyselect data=have seed=125 out=one n=200000 outall noprint;
strata Country / alloc=prop;
run;
data control;
set one;
where Selected=1;
GroupID=0;
keep ID Country var1 var2 GroupID;
run;
proc surveyselect data=one(where=(Selected=0)) seed=123 out=two noprint
groups=(3e5, 3e5, 3e5, 3e5, 3e5, 3e5);
run;
data want;
set control two(keep=ID Country var1 var2 GroupID);
run;
proc sort data=want;
by GroupID;
run;
this worked perfectly, thanks a lot!
Anytime, glad to help 🙂
I would start with something like:
proc surveyselect data=have out=want groups=(300000, 300000, 300000, 300000, 300000, 300000, 200000); run;
1. Use PROC SURVEYSELECT
2. Since you want to match by proportion of country, make your strata country.
The documentation has a fully worked example of proportional allocation here:
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Check out this tutorial series to learn how to build your own steps in SAS Studio.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.