BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
novicenoice
Calcite | Level 5

Hi,

 

I have a case data set with 70 hospital and 5 disease types. Some hospitals have less than 5 cases for the year, some hospitals have many cases. Some hospitals do not have all 5 disease types, some hospitals have all disease types.

 

I am trying to write code to randomly sample 5 cases per hospital. However, there are parameters to how the cases can be chosen. If a hospital has all 5 disease types, I want to randomly select one disease type per hospital. If there are more than 5 cases and only 4 disease types, I want to capture the 4 different cases, and the fifth case must be chosen but disease type doesn't matter. If there are only 3 cases, I want to capture all 3 (randomly sampling doesn't really matter here anymore). And so on and so forth. Random sampling is only for concordance purposes so I don't truly need a randomized sample. However, there are hospitals with more than 100 cases, with more than 15 of each disease type, and I'd like those selections to be random (not the first case SAS reads).

 

Ultimately, case number per hospital determines if I can even sample 5 (if less I'll take all), then I want to select at least one of each disease type per hospital, and then I want to sample 5 from each hospital.

 

With 70 hospitals and 5 disease types, if there were at least 5 cases per hospital, I'd have an end sample size of 350. However, that is not always the situation because some hospitals may have less than 5 cases.

 

Example below:

 

hosp_id     disease_type 

A                1         

A                1 

A                2 

A                5 

B                1

B                2

B                3

B                3

B                4

B                5

...

C                1         

C                1

C                1

C                2

C                2

C                2

C                4

C                5

...

 

Any help greatly appreciated! 🙂

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @novicenoice,

 

I'm not sure if PROC SURVEYSELECT would be particularly useful for your requirements, so here's a suggestion using the traditional technique of sorting by random numbers, balancing the disease types within each hospital (as far as possible). For example, if a hospital had ten patients -- two with disease type 1 and the rest with disease type 4 -- the first two would be selected with certainty, plus a random sample of three out of the remaining eight. If three (rather than two) of the ten patients had disease type 1, it would be decided randomly (with probability 1/2) whether all three or only a random sample of two were included in the final sample of five.

 

/* Create sample data for demonstration */

data have;
call streaminit(27182818);
do hosp_id=1 to 70;
  do _n_=1 to ceil(1/rand('expo',.2));
    disease_type=rand('table',.2,.15,.25,.3,.1);
    usubjid+1;
    output;
  end;
end;
run;

proc sort data=have;
by hosp_id disease_type;
run;

/* Create a random sort order within the (HOSP_ID, DISEASE_TYPE) strata */

data temp;
call streaminit('MT64',27182818);
set have;
_r1=rand('uniform');
run;

proc sort data=temp;
by hosp_id disease_type _r1;
run;

/* Number the observations sequentially within the strata */
/* in order to define selection priorities                */

data temp;
call streaminit('MT64',3141592);
set temp(drop=_r1);
by hosp_id disease_type;
if first.disease_type then _prio=1;
else _prio+1;
_r2=rand('uniform');
run;

/* Create a random sort order within the (HOSP_ID, _PRIO) groups */

proc sort data=temp;
by hosp_id _prio _r2;
run;

/* Select five observations per HOSP_ID if possible (otherwise all) */

data want(drop=_:);
set temp;
by hosp_id;
if first.hosp_id then _c=1;
else _c+1;
if _c<=5;
run;

proc sort data=want;
by hosp_id disease_type usubjid;
run;

View solution in original post

7 REPLIES 7
Reeza
Super User
With that many custom rules I think you have to go through and implement a manual selection. I also wouldn't call it random 🙂
novicenoice
Calcite | Level 5
I definitely can see manual selection happening with hospitals with smaller case counts.
The last thing I was trying was to split the dataset and find hospitals with either 5 or less cases or less than 5 disease types and selecting those manually.
Then I would randomly select from hospitals with more cases and all 5 disease types. If not random selection (with proc surveyselect), SAS selects the first case it sees.
Reeza
Super User
Actually the SELECTALL option in PROC SURVEY SELECT handles that scenario fine.

This is the part that's complicated:
I am trying to write code to randomly sample 5 cases per hospital. However, there are parameters to how the cases can be chosen. If a hospital has all 5 disease types, I want to randomly select one disease type per hospital. If there are more than 5 cases and only 4 disease types, I want to capture the 4 different cases, and the fifth case must be chosen but disease type doesn't matter. If there are only 3 cases, I want to capture all 3 (randomly sampling doesn't really matter here anymore).
novicenoice
Calcite | Level 5
This also works. Thank you!!
FreelanceReinh
Jade | Level 19

Hi @novicenoice,

 

I'm not sure if PROC SURVEYSELECT would be particularly useful for your requirements, so here's a suggestion using the traditional technique of sorting by random numbers, balancing the disease types within each hospital (as far as possible). For example, if a hospital had ten patients -- two with disease type 1 and the rest with disease type 4 -- the first two would be selected with certainty, plus a random sample of three out of the remaining eight. If three (rather than two) of the ten patients had disease type 1, it would be decided randomly (with probability 1/2) whether all three or only a random sample of two were included in the final sample of five.

 

/* Create sample data for demonstration */

data have;
call streaminit(27182818);
do hosp_id=1 to 70;
  do _n_=1 to ceil(1/rand('expo',.2));
    disease_type=rand('table',.2,.15,.25,.3,.1);
    usubjid+1;
    output;
  end;
end;
run;

proc sort data=have;
by hosp_id disease_type;
run;

/* Create a random sort order within the (HOSP_ID, DISEASE_TYPE) strata */

data temp;
call streaminit('MT64',27182818);
set have;
_r1=rand('uniform');
run;

proc sort data=temp;
by hosp_id disease_type _r1;
run;

/* Number the observations sequentially within the strata */
/* in order to define selection priorities                */

data temp;
call streaminit('MT64',3141592);
set temp(drop=_r1);
by hosp_id disease_type;
if first.disease_type then _prio=1;
else _prio+1;
_r2=rand('uniform');
run;

/* Create a random sort order within the (HOSP_ID, _PRIO) groups */

proc sort data=temp;
by hosp_id _prio _r2;
run;

/* Select five observations per HOSP_ID if possible (otherwise all) */

data want(drop=_:);
set temp;
by hosp_id;
if first.hosp_id then _c=1;
else _c+1;
if _c<=5;
run;

proc sort data=want;
by hosp_id disease_type usubjid;
run;
novicenoice
Calcite | Level 5
Thank you so much!! This works!
PGStats
Opal | Level 21

Random sampling with at least one per disease_id and a total of 5 per hosp_id :

 

data have;
input hosp_id $ disease_type; 
datalines;
A                1         
A                1 
A                2 
A                5 
B                1
B                2
B                3
B                3
B                4
B                5
C                1         
C                1
C                1
C                2
C                2
C                2
C                4
C                5
;

data haveRnd;
call streaminit(85865);
set have;
rnd = rand("uniform");
run;

proc sort data=haveRnd; by hosp_id disease_type rnd; run;

data haveFirst;
set haveRnd; by hosp_id disease_type;
first = first.disease_type;
run;

proc sort data=haveFirst; by hosp_id descending first rnd; run;

data want;
do order = 1 by 1 until (last.hosp_id);
    set haveFirst; by hosp_id;
    if order <= 5 then output;
    end;
drop rnd order first;
run;

PGStats_0-1629486155868.png

 

PG

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1146 views
  • 0 likes
  • 4 in conversation