You are a perinatal and reproductive epidemiologist and are given an Excel spreadsheet, Project3.xlsx, which contains 1000 records. There are a total of 5 hospitals in the dataset and in each of these hospitals, there were three possible types of infant outcomes: normal (>= 2500 grams), low birth weight (<2500 grams but >= 1500 grams), and very low birth weight (< 1500 grams).
You are interested in conducting a sub-study in which you would like to randomly select 3 very low birth weight, 10 low birth weight, and 25 normal infants from each hospital.
Will I get course credit for doing this homework?
OK, here's the randomizing part, untested. It assumes dataset HAVE is sorted by hospital. I know that RANUNI is deprecated but you can substitute SAS's improved uniform random number generator.
data want (drop=av_: w_: w ninfants);
array avail {3} av_vlbw av_lbw av_nw;
do w=1 to 3; avail{w}=0; end;
array want {3} w_vlbw w_lbw w_nw;
w_vlbw=3;
w_lbw=10;
w_nw=25;
/* Get total available infants by weight group */
do ninfants=1 by 1 until (last.hospital);
set have;
by hospital;
w=1+(weight>=1500)+(weight>=2500);
avail{w}=avail{w}+1;
end;
/* re-read and randomly draw */
do until (last.hospital);
set have;
by hospital;
w=1+(weight>=1500)+(weight>=2500);
if want{w}/avail{w}>=ranuni(09184065) then do;
output;
want{w}=want{w}-1;
end;
avail{w}=avail{w}-1;
end;
run;
Thanks.
IMO you should really look into PROC SURVEYSELECT.
proc import datafile='/folders/myfolders/project3.xlsx' out=have dbms=xlsx replace; run; proc sort data=have;by hospnm bthwt;run; proc surveyselect data=have sampsize=25 selectall out=temp; strata hospnm bthwt; run; data want; do until(last.bthwt); set temp; by hospnm bthwt; end; _bthwt=bthwt; do i=1 by 1 until(last.bthwt); set temp; by hospnm bthwt; if _bthwt='VLBW' then do; if i le 3 then output; end; else if _bthwt='LBW' then do; if i le 10 then output; end; else output; end; keep hospnm bthwt subjno; run; proc print noobs;run;
proc import datafile='/folders/myfolders/project3.xlsx' out=have dbms=xlsx replace; run; proc sort data=have;by hospnm bthwt;run; proc surveyselect data=have sampsize=25 selectall out=temp; strata hospnm bthwt; run; data want; set temp; by hospnm bthwt; if first.bthwt then i=0; i+1; if bthwt='VLBW' then do; if i le 3 then output; end; else if bthwt='LBW' then do; if i le 10 then output; end; else output; keep hospnm bthwt subjno; run; proc print noobs;run;
I have a concern about your proposed solution:
You're asking SURVEYSELECT for a sample size of 25 for each of 3 birthweight group, so that later you can take the smaller sample sizes of 3 and 10 from 2 of those groups. Let's say the hospital has 5 VLBW births (or any number between 4 and 25), in which case surveyselect would keep all the VLBW births, presumably in the original order. Then you take the first 3.
Frequently the original data might be sorted within hospital, say by date of birth. Then your suggestion risks never drawing VLBW births at the end of the time period. I think you either have to pre-sort the data in random order within hospital (in which case surveyselect is not needed at all), or run surveyselect 3 times, with 3 different WHERE statements and 3 sampsize values.
regards,
Mark
Mike, You are right. I never thought proc surveyselect would not change the order of obs when sample size is less than 25. Bad feature for proc surveyselect . This could be simple as yours : proc import datafile='/folders/myfolders/project3.xlsx' out=have dbms=xlsx replace; run; data have; set have; _rand=ranuni(0); run; proc sort data=have;by hospnm bthwt _rand;run; data want; set have; by hospnm bthwt; if first.bthwt then i=0; i+1; if bthwt='VLBW' then do; if i le 3 then output; end; else if bthwt='LBW' then do; if i le 10 then output; end; else output; keep hospnm bthwt subjno; run; proc print noobs;run;
Opps, This ought to be right. proc import datafile='/folders/myfolders/project3.xlsx' out=have dbms=xlsx replace; run; data have; set have; _rand=ranuni(0); run; proc sort data=have;by hospnm bthwt _rand;run; data want; set have; by hospnm bthwt; if first.bthwt then i=0; i+1; if bthwt='VLBW' then do; if i le 3 then output; end; else if bthwt='LBW' then do; if i le 10 then output; end; else do; if i le 25 then output; end; keep hospnm bthwt subjno; run; proc print noobs;run;
Nice program. I like the idea of sorting by hospnm bthwt. But I think further simplification is achieveable:
data have (drop=imax i rand);
retain imax;
set have;
by hospnm bthwt;
if first.bthwt then do;
if bthwt='VLBW' then imax=3; else
if bthwt='LBW' then imax=10; else
if bthwt='Normal' then imax=25;
i=0;
end;
i+1;
if i<=imax;
run;
My comment to @Ksharp prompted a thought of a simpler approach. It has more steps, but doesn't require counting the number of available births, and preserves equal sampling probability for each record within a hospital/birthweight group.
Now editted to use orginal var names and values:
And edditted again - in my first edit I used strikeout font for variable hospital, which I see is not preserved in the final version. Those are gone now.
data need / view=need;
set have;
rnum=ranuni(01672356);
run;
proc sort data=need out=sorted;
by hospm rnum;
run;
data want (drop=rnum w);
set sorted;
by hospnm ;
array need {3} _temporary_ ;
if first.hospnm then do;
need{1}=3; need{2}=10; need{3}=25;
end;
w=findw('VLBW LBW Normal',trim(bthwt),' ','e');
if need{w}>0;
need{w}=need{w}-1;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.