Solved: Re: Random sampling with parameters

novicenoice · Posted 08-19-2021 02:16 PM

Hi,

I have a case data set with 70 hospital and 5 disease types. Some hospitals have less than 5 cases for the year, some hospitals have many cases. Some hospitals do not have all 5 disease types, some hospitals have all disease types.

I am trying to write code to randomly sample 5 cases per hospital. However, there are parameters to how the cases can be chosen. If a hospital has all 5 disease types, I want to randomly select one disease type per hospital. If there are more than 5 cases and only 4 disease types, I want to capture the 4 different cases, and the fifth case must be chosen but disease type doesn't matter. If there are only 3 cases, I want to capture all 3 (randomly sampling doesn't really matter here anymore). And so on and so forth. Random sampling is only for concordance purposes so I don't truly need a randomized sample. However, there are hospitals with more than 100 cases, with more than 15 of each disease type, and I'd like those selections to be random (not the first case SAS reads).

Ultimately, case number per hospital determines if I can even sample 5 (if less I'll take all), then I want to select at least one of each disease type per hospital, and then I want to sample 5 from each hospital.

With 70 hospitals and 5 disease types, if there were at least 5 cases per hospital, I'd have an end sample size of 350. However, that is not always the situation because some hospitals may have less than 5 cases.

Example below:

hosp_id disease_type

A 1

A 2

A 5

B 1

B 2

B 3

B 4

B 5

...

C 1

C 2

C 4

C 5

...

Any help greatly appreciated! 🙂

FreelanceReinh · Posted 08-20-2021 02:32 PM

Hi @novicenoice,

I'm not sure if PROC SURVEYSELECT would be particularly useful for your requirements, so here's a suggestion using the traditional technique of sorting by random numbers, balancing the disease types within each hospital (as far as possible). For example, if a hospital had ten patients -- two with disease type 1 and the rest with disease type 4 -- the first two would be selected with certainty, plus a random sample of three out of the remaining eight. If three (rather than two) of the ten patients had disease type 1, it would be decided randomly (with probability 1/2) whether all three or only a random sample of two were included in the final sample of five.

/* Create sample data for demonstration */

data have;
call streaminit(27182818);
do hosp_id=1 to 70;
  do _n_=1 to ceil(1/rand('expo',.2));
    disease_type=rand('table',.2,.15,.25,.3,.1);
    usubjid+1;
    output;
  end;
end;
run;

proc sort data=have;
by hosp_id disease_type;
run;

/* Create a random sort order within the (HOSP_ID, DISEASE_TYPE) strata */

data temp;
call streaminit('MT64',27182818);
set have;
_r1=rand('uniform');
run;

proc sort data=temp;
by hosp_id disease_type _r1;
run;

/* Number the observations sequentially within the strata */
/* in order to define selection priorities                */

data temp;
call streaminit('MT64',3141592);
set temp(drop=_r1);
by hosp_id disease_type;
if first.disease_type then _prio=1;
else _prio+1;
_r2=rand('uniform');
run;

/* Create a random sort order within the (HOSP_ID, _PRIO) groups */

proc sort data=temp;
by hosp_id _prio _r2;
run;

/* Select five observations per HOSP_ID if possible (otherwise all) */

data want(drop=_:);
set temp;
by hosp_id;
if first.hosp_id then _c=1;
else _c+1;
if _c<=5;
run;

proc sort data=want;
by hosp_id disease_type usubjid;
run;

View solution in original post

Reeza · Posted 08-19-2021 02:49 PM

With that many custom rules I think you have to go through and implement a manual selection. I also wouldn't call it random 🙂

novicenoice · Posted 08-19-2021 02:55 PM

I definitely can see manual selection happening with hospitals with smaller case counts.
The last thing I was trying was to split the dataset and find hospitals with either 5 or less cases or less than 5 disease types and selecting those manually.
Then I would randomly select from hospitals with more cases and all 5 disease types. If not random selection (with proc surveyselect), SAS selects the first case it sees.

Reeza · Posted 08-19-2021 03:28 PM

Actually the SELECTALL option in PROC SURVEY SELECT handles that scenario fine.

This is the part that's complicated:
I am trying to write code to randomly sample 5 cases per hospital. However, there are parameters to how the cases can be chosen. If a hospital has all 5 disease types, I want to randomly select one disease type per hospital. If there are more than 5 cases and only 4 disease types, I want to capture the 4 different cases, and the fifth case must be chosen but disease type doesn't matter. If there are only 3 cases, I want to capture all 3 (randomly sampling doesn't really matter here anymore).

novicenoice · Posted 08-23-2021 04:02 PM

This also works. Thank you!!

FreelanceReinh · Posted 08-20-2021 02:32 PM

Hi @novicenoice,

I'm not sure if PROC SURVEYSELECT would be particularly useful for your requirements, so here's a suggestion using the traditional technique of sorting by random numbers, balancing the disease types within each hospital (as far as possible). For example, if a hospital had ten patients -- two with disease type 1 and the rest with disease type 4 -- the first two would be selected with certainty, plus a random sample of three out of the remaining eight. If three (rather than two) of the ten patients had disease type 1, it would be decided randomly (with probability 1/2) whether all three or only a random sample of two were included in the final sample of five.

/* Create sample data for demonstration */

data have;
call streaminit(27182818);
do hosp_id=1 to 70;
  do _n_=1 to ceil(1/rand('expo',.2));
    disease_type=rand('table',.2,.15,.25,.3,.1);
    usubjid+1;
    output;
  end;
end;
run;

proc sort data=have;
by hosp_id disease_type;
run;

/* Create a random sort order within the (HOSP_ID, DISEASE_TYPE) strata */

data temp;
call streaminit('MT64',27182818);
set have;
_r1=rand('uniform');
run;

proc sort data=temp;
by hosp_id disease_type _r1;
run;

/* Number the observations sequentially within the strata */
/* in order to define selection priorities                */

data temp;
call streaminit('MT64',3141592);
set temp(drop=_r1);
by hosp_id disease_type;
if first.disease_type then _prio=1;
else _prio+1;
_r2=rand('uniform');
run;

/* Create a random sort order within the (HOSP_ID, _PRIO) groups */

proc sort data=temp;
by hosp_id _prio _r2;
run;

/* Select five observations per HOSP_ID if possible (otherwise all) */

data want(drop=_:);
set temp;
by hosp_id;
if first.hosp_id then _c=1;
else _c+1;
if _c<=5;
run;

proc sort data=want;
by hosp_id disease_type usubjid;
run;

novicenoice · Posted 08-23-2021 04:01 PM

Thank you so much!! This works!

PGStats · Posted 08-20-2021 03:03 PM

Random sampling with at least one per disease_id and a total of 5 per hosp_id :

data have;
input hosp_id $ disease_type; 
datalines;
A                1         
A                1 
A                2 
A                5 
B                1
B                2
B                3
B                3
B                4
B                5
C                1         
C                1
C                1
C                2
C                2
C                2
C                4
C                5
;

data haveRnd;
call streaminit(85865);
set have;
rnd = rand("uniform");
run;

proc sort data=haveRnd; by hosp_id disease_type rnd; run;

data haveFirst;
set haveRnd; by hosp_id disease_type;
first = first.disease_type;
run;

proc sort data=haveFirst; by hosp_id descending first rnd; run;

data want;
do order = 1 by 1 until (last.hosp_id);
    set haveFirst; by hosp_id;
    if order <= 5 then output;
    end;
drop rnd order first;
run;

PG

Registration is open

SAS Training: Just a Click Away