Hi,
I have a case data set with 70 hospital and 5 disease types. Some hospitals have less than 5 cases for the year, some hospitals have many cases. Some hospitals do not have all 5 disease types, some hospitals have all disease types.
I am trying to write code to randomly sample 5 cases per hospital. However, there are parameters to how the cases can be chosen. If a hospital has all 5 disease types, I want to randomly select one disease type per hospital. If there are more than 5 cases and only 4 disease types, I want to capture the 4 different cases, and the fifth case must be chosen but disease type doesn't matter. If there are only 3 cases, I want to capture all 3 (randomly sampling doesn't really matter here anymore). And so on and so forth. Random sampling is only for concordance purposes so I don't truly need a randomized sample. However, there are hospitals with more than 100 cases, with more than 15 of each disease type, and I'd like those selections to be random (not the first case SAS reads).
Ultimately, case number per hospital determines if I can even sample 5 (if less I'll take all), then I want to select at least one of each disease type per hospital, and then I want to sample 5 from each hospital.
With 70 hospitals and 5 disease types, if there were at least 5 cases per hospital, I'd have an end sample size of 350. However, that is not always the situation because some hospitals may have less than 5 cases.
Example below:
hosp_id disease_type
A 1
A 1
A 2
A 5
B 1
B 2
B 3
B 3
B 4
B 5
...
C 1
C 1
C 1
C 2
C 2
C 2
C 4
C 5
...
Any help greatly appreciated! 🙂
Hi @novicenoice,
I'm not sure if PROC SURVEYSELECT would be particularly useful for your requirements, so here's a suggestion using the traditional technique of sorting by random numbers, balancing the disease types within each hospital (as far as possible). For example, if a hospital had ten patients -- two with disease type 1 and the rest with disease type 4 -- the first two would be selected with certainty, plus a random sample of three out of the remaining eight. If three (rather than two) of the ten patients had disease type 1, it would be decided randomly (with probability 1/2) whether all three or only a random sample of two were included in the final sample of five.
/* Create sample data for demonstration */
data have;
call streaminit(27182818);
do hosp_id=1 to 70;
do _n_=1 to ceil(1/rand('expo',.2));
disease_type=rand('table',.2,.15,.25,.3,.1);
usubjid+1;
output;
end;
end;
run;
proc sort data=have;
by hosp_id disease_type;
run;
/* Create a random sort order within the (HOSP_ID, DISEASE_TYPE) strata */
data temp;
call streaminit('MT64',27182818);
set have;
_r1=rand('uniform');
run;
proc sort data=temp;
by hosp_id disease_type _r1;
run;
/* Number the observations sequentially within the strata */
/* in order to define selection priorities */
data temp;
call streaminit('MT64',3141592);
set temp(drop=_r1);
by hosp_id disease_type;
if first.disease_type then _prio=1;
else _prio+1;
_r2=rand('uniform');
run;
/* Create a random sort order within the (HOSP_ID, _PRIO) groups */
proc sort data=temp;
by hosp_id _prio _r2;
run;
/* Select five observations per HOSP_ID if possible (otherwise all) */
data want(drop=_:);
set temp;
by hosp_id;
if first.hosp_id then _c=1;
else _c+1;
if _c<=5;
run;
proc sort data=want;
by hosp_id disease_type usubjid;
run;
Hi @novicenoice,
I'm not sure if PROC SURVEYSELECT would be particularly useful for your requirements, so here's a suggestion using the traditional technique of sorting by random numbers, balancing the disease types within each hospital (as far as possible). For example, if a hospital had ten patients -- two with disease type 1 and the rest with disease type 4 -- the first two would be selected with certainty, plus a random sample of three out of the remaining eight. If three (rather than two) of the ten patients had disease type 1, it would be decided randomly (with probability 1/2) whether all three or only a random sample of two were included in the final sample of five.
/* Create sample data for demonstration */
data have;
call streaminit(27182818);
do hosp_id=1 to 70;
do _n_=1 to ceil(1/rand('expo',.2));
disease_type=rand('table',.2,.15,.25,.3,.1);
usubjid+1;
output;
end;
end;
run;
proc sort data=have;
by hosp_id disease_type;
run;
/* Create a random sort order within the (HOSP_ID, DISEASE_TYPE) strata */
data temp;
call streaminit('MT64',27182818);
set have;
_r1=rand('uniform');
run;
proc sort data=temp;
by hosp_id disease_type _r1;
run;
/* Number the observations sequentially within the strata */
/* in order to define selection priorities */
data temp;
call streaminit('MT64',3141592);
set temp(drop=_r1);
by hosp_id disease_type;
if first.disease_type then _prio=1;
else _prio+1;
_r2=rand('uniform');
run;
/* Create a random sort order within the (HOSP_ID, _PRIO) groups */
proc sort data=temp;
by hosp_id _prio _r2;
run;
/* Select five observations per HOSP_ID if possible (otherwise all) */
data want(drop=_:);
set temp;
by hosp_id;
if first.hosp_id then _c=1;
else _c+1;
if _c<=5;
run;
proc sort data=want;
by hosp_id disease_type usubjid;
run;
Random sampling with at least one per disease_id and a total of 5 per hosp_id :
data have;
input hosp_id $ disease_type;
datalines;
A 1
A 1
A 2
A 5
B 1
B 2
B 3
B 3
B 4
B 5
C 1
C 1
C 1
C 2
C 2
C 2
C 4
C 5
;
data haveRnd;
call streaminit(85865);
set have;
rnd = rand("uniform");
run;
proc sort data=haveRnd; by hosp_id disease_type rnd; run;
data haveFirst;
set haveRnd; by hosp_id disease_type;
first = first.disease_type;
run;
proc sort data=haveFirst; by hosp_id descending first rnd; run;
data want;
do order = 1 by 1 until (last.hosp_id);
set haveFirst; by hosp_id;
if order <= 5 then output;
end;
drop rnd order first;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.