I have a large health plan dataset (n~36 million records) that I need to randomly select two non-decedent controls for each decedent case. The matching variable is decile of air pollutant exposure, which was calculated from the full distribution of daily pollutant concentrations for all zip codes occurring in the dataset, across the 2-year study period. I've tried to approach this with 3 files:
File 1 - Decedent cases (n=2.8 million): ID, Date (DeathDt), Decile (on DeathDt)
File 2 - Bridge file: Date, Zipcode, Decile [file contains all possible combinations of date and zipcode in the non-decedent file, with decile assigned based on pollutant concentration on that date for that zipcode]
File 3 - Non-decedent controls (n=34.6 million): ID, Zipcode
The files look like this:
File 1:
ID | Date | Decile |
1 | 1/1/2017 | 1 |
22 | 1/1/2017 | 1 |
41 | 1/1/2017 | 1 |
56 | 1/1/2017 | 2 |
79 | 1/1/2017 | 2 |
85 | 1/1/2017 | 2 |
100 | 1/2/2017 | 1 |
118 | 1/2/2017 | 1 |
125 | 1/2/2017 | 2 |
167 | 1/2/2017 | 2 |
178 | 1/2/2017 | 3 |
File 2:
Date | Zipcode | Decile |
1/1/2017 | 12832 | 1 |
1/1/2017 | 03349 | 1 |
1/1/2017 | 04001 | 2 |
1/2/2017 | 56723 | 2 |
1/2/2017 | 88123 | 1 |
1/3/2017 | 80010 | 3 |
1/3/2017 | 96224 | 3 |
File 3:
ID | Zipcode |
2 | 88123 |
3 | 12345 |
4 | 03304 |
5 | 03867 |
6 | 04945 |
7 | 04001 |
8 | 98765 |
9 | 98801 |
10 | 96224 |
11 | 00001 |
12 | 83356 |
Again, for each decedent case, I need to randomly select 2 non-decedent controls who were alive as of the case's death date, but this can't be done directly because there is no date in the non-decedent file. I created a cartesian product joining Files 1 and 2, which produced a file named 'combined', and gave me the full range of zipcodes that are in the same decile as the case, on the case's date of death (code below). My plan was to join this file with the non-decedents file by zipcode, and then select 2 controls randomly. But at this point, the files were huge, and I ran out of space...and I don't know really how to do the random selection from here. I'm thinking there has to be a more efficient way to do this. Sorry for the complicated description. Would appreciate any advice for how to do this..
proc sql;
create table combined as
select f1.decile as CaseDecile, f1.date as CaseDate
f2.decile as BridgeDecile, f2.date as BridgeDate, f2.zipcode as BridgeZipcode
from decedents as f1, bridge as f2
where f1.decile = f2.decile AND f1.date = f2.date;
quit;
@JKHess
Below my attempt to meet your requirements based on my understanding. I believe the code does what it should but I strongly suggest you take some time to create some really good sample data for testing where you know exactly how the desired result needs to look like.
I've chosen for performance reasons a data step/hash table approach. Your data volumes are serious so depending on the memory you've got things might or might not work.
I'm sharing two coding options (variations). Option 1 requires a bit less data prep but in turn might require up to 7GB of memory, option 2 should already work with 4GB of memory.
You can check your memory settings via below (results are in bytes, zero means normally as much as available).
proc options group=memory;
run;
Option 1:
/*************** create sample data ************************/
/* file 1 */
data dec;
input ID Date :mmddyy10. Decile;
format Date mmddyy10.;
datalines;
1 1/1/2017 1
22 1/1/2017 1
41 1/1/2017 1
56 1/1/2017 2
79 1/1/2017 2
85 1/1/2017 2
100 1/2/2017 1
118 1/2/2017 1
125 1/2/2017 2
167 1/2/2017 2
178 1/2/2017 3
;
run;
/* file 2 - not really a bridge because relationship bridge:no_dec is many:many */
data bridge;
input Date :mmddyy10. Decile Zipcode $5.;
format Date mmddyy10.;
datalines;
1/1/2017 1 88123
1/1/2017 1 03867
1/1/2017 1 04001
1/1/2017 2 03304
1/1/2017 2 98765
1/1/2017 2 96224
1/1/2017 2 00001
1/2/2017 1 98801
1/2/2017 2 88123
1/2/2017 2 12345
1/2/2017 2 83356
1/2/2017 2 98765
1/2/2017 3 03304
1/2/2017 3 04945
;
run;
/* file 3 */
data no_dec;
input ID Zipcode $5.;
datalines;
2 88123
21 88123
22 88123
23 88123
24 88123
3 12345
4 03304
5 03867
6 04945
7 04001
8 98765
9 98801
10 96224
11 00001
12 83356
13 83356
;
run;
/************ draw control *************************************/
/* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */
data _null_;
dcl hash h1(ordered:'y', multidata:'y');
h1.defineKey('Zipcode','ran_no');
h1.defineData('id','zipcode','ran_no');
h1.defineDone();
call streaminit(10);
do until(_last);
set no_dec end=_last;
ran_no=rand('uniform');
_rc=h1.add();
end;
_rc=h1.output(dataset:'no_dec_ranno');
stop;
run;
data control(keep=id_dec id zipcode date Decile select_cnt)
control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt)
;
length id_dec 8;
if _n_=1 then
do;
call streaminit(10);
/* load no_dec_ranno into hash replacing the random values by a counter */
/* - the order is still random but the counter instead of a random value will allow to address specific items later on */
/* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB */
dcl hash h_nodec(ordered:'y');
h_nodec.defineKey('Zipcode','seq_no');
h_nodec.defineData('id','zipcode','seq_no');
h_nodec.defineDone();
do until(_last);
set no_dec_ranno(drop=ran_no) end=_last;
by Zipcode;
if first.zipcode then seq_no=1;
else seq_no+1;
_rc=h_nodec.add();
end;
/* load the bridge data into a hash */
dcl hash h_brdg(dataset:'bridge', multidata:'y', ordered:'y');
h_brdg.defineKey('date','decile');
h_brdg.defineData('zipcode');
h_brdg.defineDone();
/* hash to store per zipcode the last seq_no (counter) used to populate the table with control data */
dcl hash h_last_ranno();
h_last_ranno.defineKey('zipcode');
h_last_ranno.defineData('seq_no');
h_last_ranno.defineDone();
/* hash to store for the current row of dec (file 1) all matching rows from the bridge table */
dcl hash h_zipcollect();
h_zipcollect.defineKey('iter');
h_zipcollect.defineData('zipcode');
h_zipcollect.defineDone();
end;
call missing(of _all_);
set dec(rename=(id=id_dec));
/*** draw two controls for case ***/
select_cnt=0;
/** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */
/* load all zipcodes from the bridge into hash h_zipcollect that match with the current row from dec (file 1) */
_rc=h_zipcollect.clear();
do iter=1 by 1 while(h_brdg.do_over() = 0);
_rc=h_zipcollect.add();
end;
do i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */
/* random selection of one of the matching zipcodes from hash h_zipcollect */
iter=rand('integer',1,h_zipcollect.num_items);
if h_zipcollect.find()=0 then
do;
/** 2. draw control **/
/* for the current zipcode derive lowest value for seq_no for the row that hasn't been drawn already */
if h_last_ranno.find() ne 0 then
do;
seq_no=1;
_rc=h_last_ranno.add();
end;
/* draw control record */
if h_nodec.find()=0 then
do;
/* count how many control records selected for the current record from dec */
select_cnt=sum(select_cnt,1);
output control;
/* remove selected record from hash as we won't select it again */
_rc=h_nodec.remove();
/* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */
seq_no=sum(seq_no,1);
_rc=h_last_ranno.replace();
end;
else
do;
/* in case all rows for a zipcode have already been drawn don't use this zipcode any further */
_rc=h_zipcollect.remove();
end;
end;
end;
if select_cnt<2 then output control_insufficient_data;
run;
/* title 'control'; */
/* proc print data=control; */
/* run; */
/* title 'Decedent with insufficient matching data to create control'; */
/* proc sql; */
/* select * */
/* from control_insufficient_data; */
/* quit; */
/* title; */
Option 2:
/*************** create sample data ************************/
/* file 1 */
data dec;
input ID Date :mmddyy10. Decile;
format Date mmddyy10.;
datalines;
1 1/1/2017 1
22 1/1/2017 1
41 1/1/2017 1
56 1/1/2017 2
79 1/1/2017 2
85 1/1/2017 2
100 1/2/2017 1
118 1/2/2017 1
125 1/2/2017 2
167 1/2/2017 2
178 1/2/2017 3
;
run;
/* file 2 - not really a bridge because relationship bridge:no_dec is many:many */
data bridge;
input Date :mmddyy10. Decile Zipcode $5.;
format Date mmddyy10.;
datalines;
1/1/2017 1 88123
1/1/2017 1 03867
1/1/2017 1 04001
1/1/2017 2 03304
1/1/2017 2 98765
1/1/2017 2 96224
1/1/2017 2 00001
1/2/2017 1 98801
1/2/2017 2 88123
1/2/2017 2 12345
1/2/2017 2 83356
1/2/2017 2 98765
1/2/2017 3 03304
1/2/2017 3 04945
;
run;
/* file 3 */
data no_dec;
input ID Zipcode $5.;
datalines;
2 88123
21 88123
22 88123
23 88123
24 88123
3 12345
4 03304
5 03867
6 04945
7 04001
8 98765
9 98801
10 96224
11 00001
12 83356
13 83356
;
run;
/************ date prep *************************************/
/* sort dec by date and decile to allow for by group processing */
proc sort data=dec presorted;
by date decile;
run;
/* sort bridge by date and decile to allow for by group processing */
/* - required because bridge table likely to big for loading in total into hash table */
proc sort data=bridge presorted;
by date decile;
run;
proc sql;
create index idx_dt_dec
on bridge (date, decile);
quit;
/* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */
data _null_;
dcl hash h1(ordered:'y', multidata:'y');
h1.defineKey('Zipcode','ran_no');
h1.defineData('id','zipcode','ran_no');
h1.defineDone();
call streaminit(10);
do until(_last);
set no_dec end=_last;
ran_no=rand('uniform');
_rc=h1.add();
end;
_rc=h1.output(dataset:'no_dec_ranno');
stop;
run;
/************ draw control *************************************/
data control(keep=id_dec id zipcode date Decile select_cnt)
control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt)
;
length id_dec 8;
if _n_=1 then
do;
call streaminit(10);
/* load no_dec_ranno into hash replacing the random values by a counter */
/* - the order is still random but the counter instead of a random value will allow to address specific items later on */
/* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB */
dcl hash h_nodec(ordered:'y');
h_nodec.defineKey('Zipcode','seq_no');
h_nodec.defineData('id','zipcode','seq_no');
h_nodec.defineDone();
do until(_last);
set no_dec_ranno(drop=ran_no) end=_last;
by Zipcode;
if first.zipcode then seq_no=1;
else seq_no+1;
_rc=h_nodec.add();
end;
/* create hash for bridge data */
dcl hash h_brdg(dataset:'bridge(obs=0)', multidata:'y', ordered:'y');
h_brdg.defineKey('date','decile');
h_brdg.defineData('zipcode');
h_brdg.defineDone();
/* hash to store per zipcode the last seq_no (counter) used to populate the table with control data */
dcl hash h_last_ranno();
h_last_ranno.defineKey('zipcode');
h_last_ranno.defineData('seq_no');
h_last_ranno.defineDone();
/* hash to store for the current row of dec (file 1) all matching rows from the bridge table */
dcl hash h_zipcollect();
h_zipcollect.defineKey('iter');
h_zipcollect.defineData('zipcode');
h_zipcollect.defineDone();
end;
call missing(of _all_);
set dec(rename=(id=id_dec));
by date decile;
if first.decile then
do;
/* load bridge data for current date into hash h_brdg */
_rc=h_brdg.delete();
h_brdg = _new_ hash(dataset:catx(' ','bridge(where=(date=',date,'and decile=',decile,'))'), multidata:'y', ordered:'y');
h_brdg.defineKey('date','decile');
h_brdg.defineData('zipcode');
h_brdg.defineDone();
/* "copy" h_brdg to h_zipcollect that got a sequence key (iter) which allows to directly address items */
_rc=h_zipcollect.clear();
do iter=1 by 1 while(h_brdg.do_over() = 0);
_rc=h_zipcollect.add();
end;
end;
/*** draw two controls per row in dec (file 1) ***/
select_cnt=0;
/** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */
do i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */
/* random selection of one of the zipcodes in hash h_zipcollect */
iter=rand('integer',1,h_zipcollect.num_items);
if h_zipcollect.find()=0 then
do;
/** 2. draw control **/
/* for the current zipcode derive lowest value for seq_no for the row that hasn't been drawn already */
if h_last_ranno.find() ne 0 then
do;
seq_no=1;
_rc=h_last_ranno.add();
end;
/* draw control record */
if h_nodec.find()=0 then
do;
/* count how many control records selected for the current record from dec */
select_cnt=sum(select_cnt,1);
output control;
/* remove selected record from hash as we won't select it again */
_rc=h_nodec.remove();
/* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */
seq_no=sum(seq_no,1);
_rc=h_last_ranno.replace();
end;
else
do;
/* in case all rows for a zipcode have already been drawn don't use this zipcode any further */
_rc=h_zipcollect.remove();
end;
end;
end;
if select_cnt<2 then output control_insufficient_data;
run;
/* title 'control'; */
/* proc print data=control; */
/* run; */
/* title 'Decedent with insufficient matching data to create control'; */
/* proc sql; */
/* select * */
/* from control_insufficient_data; */
/* quit; */
/* title; */
You call your file 3 "Non-decedent controls". Does this mean this file doesn't contain any id's of deceased people? Or would we need to identify these via file 1?
When selecting two controls for a deceased case are they then excluded for further cases or can they get selected twice. Basically after the selection process using file 1 with 2.8 million rows do you expect a control file with 5.6 million rows (distinct id's) or could it be less?
@JKHess
Below my attempt to meet your requirements based on my understanding. I believe the code does what it should but I strongly suggest you take some time to create some really good sample data for testing where you know exactly how the desired result needs to look like.
I've chosen for performance reasons a data step/hash table approach. Your data volumes are serious so depending on the memory you've got things might or might not work.
I'm sharing two coding options (variations). Option 1 requires a bit less data prep but in turn might require up to 7GB of memory, option 2 should already work with 4GB of memory.
You can check your memory settings via below (results are in bytes, zero means normally as much as available).
proc options group=memory;
run;
Option 1:
/*************** create sample data ************************/
/* file 1 */
data dec;
input ID Date :mmddyy10. Decile;
format Date mmddyy10.;
datalines;
1 1/1/2017 1
22 1/1/2017 1
41 1/1/2017 1
56 1/1/2017 2
79 1/1/2017 2
85 1/1/2017 2
100 1/2/2017 1
118 1/2/2017 1
125 1/2/2017 2
167 1/2/2017 2
178 1/2/2017 3
;
run;
/* file 2 - not really a bridge because relationship bridge:no_dec is many:many */
data bridge;
input Date :mmddyy10. Decile Zipcode $5.;
format Date mmddyy10.;
datalines;
1/1/2017 1 88123
1/1/2017 1 03867
1/1/2017 1 04001
1/1/2017 2 03304
1/1/2017 2 98765
1/1/2017 2 96224
1/1/2017 2 00001
1/2/2017 1 98801
1/2/2017 2 88123
1/2/2017 2 12345
1/2/2017 2 83356
1/2/2017 2 98765
1/2/2017 3 03304
1/2/2017 3 04945
;
run;
/* file 3 */
data no_dec;
input ID Zipcode $5.;
datalines;
2 88123
21 88123
22 88123
23 88123
24 88123
3 12345
4 03304
5 03867
6 04945
7 04001
8 98765
9 98801
10 96224
11 00001
12 83356
13 83356
;
run;
/************ draw control *************************************/
/* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */
data _null_;
dcl hash h1(ordered:'y', multidata:'y');
h1.defineKey('Zipcode','ran_no');
h1.defineData('id','zipcode','ran_no');
h1.defineDone();
call streaminit(10);
do until(_last);
set no_dec end=_last;
ran_no=rand('uniform');
_rc=h1.add();
end;
_rc=h1.output(dataset:'no_dec_ranno');
stop;
run;
data control(keep=id_dec id zipcode date Decile select_cnt)
control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt)
;
length id_dec 8;
if _n_=1 then
do;
call streaminit(10);
/* load no_dec_ranno into hash replacing the random values by a counter */
/* - the order is still random but the counter instead of a random value will allow to address specific items later on */
/* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB */
dcl hash h_nodec(ordered:'y');
h_nodec.defineKey('Zipcode','seq_no');
h_nodec.defineData('id','zipcode','seq_no');
h_nodec.defineDone();
do until(_last);
set no_dec_ranno(drop=ran_no) end=_last;
by Zipcode;
if first.zipcode then seq_no=1;
else seq_no+1;
_rc=h_nodec.add();
end;
/* load the bridge data into a hash */
dcl hash h_brdg(dataset:'bridge', multidata:'y', ordered:'y');
h_brdg.defineKey('date','decile');
h_brdg.defineData('zipcode');
h_brdg.defineDone();
/* hash to store per zipcode the last seq_no (counter) used to populate the table with control data */
dcl hash h_last_ranno();
h_last_ranno.defineKey('zipcode');
h_last_ranno.defineData('seq_no');
h_last_ranno.defineDone();
/* hash to store for the current row of dec (file 1) all matching rows from the bridge table */
dcl hash h_zipcollect();
h_zipcollect.defineKey('iter');
h_zipcollect.defineData('zipcode');
h_zipcollect.defineDone();
end;
call missing(of _all_);
set dec(rename=(id=id_dec));
/*** draw two controls for case ***/
select_cnt=0;
/** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */
/* load all zipcodes from the bridge into hash h_zipcollect that match with the current row from dec (file 1) */
_rc=h_zipcollect.clear();
do iter=1 by 1 while(h_brdg.do_over() = 0);
_rc=h_zipcollect.add();
end;
do i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */
/* random selection of one of the matching zipcodes from hash h_zipcollect */
iter=rand('integer',1,h_zipcollect.num_items);
if h_zipcollect.find()=0 then
do;
/** 2. draw control **/
/* for the current zipcode derive lowest value for seq_no for the row that hasn't been drawn already */
if h_last_ranno.find() ne 0 then
do;
seq_no=1;
_rc=h_last_ranno.add();
end;
/* draw control record */
if h_nodec.find()=0 then
do;
/* count how many control records selected for the current record from dec */
select_cnt=sum(select_cnt,1);
output control;
/* remove selected record from hash as we won't select it again */
_rc=h_nodec.remove();
/* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */
seq_no=sum(seq_no,1);
_rc=h_last_ranno.replace();
end;
else
do;
/* in case all rows for a zipcode have already been drawn don't use this zipcode any further */
_rc=h_zipcollect.remove();
end;
end;
end;
if select_cnt<2 then output control_insufficient_data;
run;
/* title 'control'; */
/* proc print data=control; */
/* run; */
/* title 'Decedent with insufficient matching data to create control'; */
/* proc sql; */
/* select * */
/* from control_insufficient_data; */
/* quit; */
/* title; */
Option 2:
/*************** create sample data ************************/
/* file 1 */
data dec;
input ID Date :mmddyy10. Decile;
format Date mmddyy10.;
datalines;
1 1/1/2017 1
22 1/1/2017 1
41 1/1/2017 1
56 1/1/2017 2
79 1/1/2017 2
85 1/1/2017 2
100 1/2/2017 1
118 1/2/2017 1
125 1/2/2017 2
167 1/2/2017 2
178 1/2/2017 3
;
run;
/* file 2 - not really a bridge because relationship bridge:no_dec is many:many */
data bridge;
input Date :mmddyy10. Decile Zipcode $5.;
format Date mmddyy10.;
datalines;
1/1/2017 1 88123
1/1/2017 1 03867
1/1/2017 1 04001
1/1/2017 2 03304
1/1/2017 2 98765
1/1/2017 2 96224
1/1/2017 2 00001
1/2/2017 1 98801
1/2/2017 2 88123
1/2/2017 2 12345
1/2/2017 2 83356
1/2/2017 2 98765
1/2/2017 3 03304
1/2/2017 3 04945
;
run;
/* file 3 */
data no_dec;
input ID Zipcode $5.;
datalines;
2 88123
21 88123
22 88123
23 88123
24 88123
3 12345
4 03304
5 03867
6 04945
7 04001
8 98765
9 98801
10 96224
11 00001
12 83356
13 83356
;
run;
/************ date prep *************************************/
/* sort dec by date and decile to allow for by group processing */
proc sort data=dec presorted;
by date decile;
run;
/* sort bridge by date and decile to allow for by group processing */
/* - required because bridge table likely to big for loading in total into hash table */
proc sort data=bridge presorted;
by date decile;
run;
proc sql;
create index idx_dt_dec
on bridge (date, decile);
quit;
/* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */
data _null_;
dcl hash h1(ordered:'y', multidata:'y');
h1.defineKey('Zipcode','ran_no');
h1.defineData('id','zipcode','ran_no');
h1.defineDone();
call streaminit(10);
do until(_last);
set no_dec end=_last;
ran_no=rand('uniform');
_rc=h1.add();
end;
_rc=h1.output(dataset:'no_dec_ranno');
stop;
run;
/************ draw control *************************************/
data control(keep=id_dec id zipcode date Decile select_cnt)
control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt)
;
length id_dec 8;
if _n_=1 then
do;
call streaminit(10);
/* load no_dec_ranno into hash replacing the random values by a counter */
/* - the order is still random but the counter instead of a random value will allow to address specific items later on */
/* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB */
dcl hash h_nodec(ordered:'y');
h_nodec.defineKey('Zipcode','seq_no');
h_nodec.defineData('id','zipcode','seq_no');
h_nodec.defineDone();
do until(_last);
set no_dec_ranno(drop=ran_no) end=_last;
by Zipcode;
if first.zipcode then seq_no=1;
else seq_no+1;
_rc=h_nodec.add();
end;
/* create hash for bridge data */
dcl hash h_brdg(dataset:'bridge(obs=0)', multidata:'y', ordered:'y');
h_brdg.defineKey('date','decile');
h_brdg.defineData('zipcode');
h_brdg.defineDone();
/* hash to store per zipcode the last seq_no (counter) used to populate the table with control data */
dcl hash h_last_ranno();
h_last_ranno.defineKey('zipcode');
h_last_ranno.defineData('seq_no');
h_last_ranno.defineDone();
/* hash to store for the current row of dec (file 1) all matching rows from the bridge table */
dcl hash h_zipcollect();
h_zipcollect.defineKey('iter');
h_zipcollect.defineData('zipcode');
h_zipcollect.defineDone();
end;
call missing(of _all_);
set dec(rename=(id=id_dec));
by date decile;
if first.decile then
do;
/* load bridge data for current date into hash h_brdg */
_rc=h_brdg.delete();
h_brdg = _new_ hash(dataset:catx(' ','bridge(where=(date=',date,'and decile=',decile,'))'), multidata:'y', ordered:'y');
h_brdg.defineKey('date','decile');
h_brdg.defineData('zipcode');
h_brdg.defineDone();
/* "copy" h_brdg to h_zipcollect that got a sequence key (iter) which allows to directly address items */
_rc=h_zipcollect.clear();
do iter=1 by 1 while(h_brdg.do_over() = 0);
_rc=h_zipcollect.add();
end;
end;
/*** draw two controls per row in dec (file 1) ***/
select_cnt=0;
/** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */
do i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */
/* random selection of one of the zipcodes in hash h_zipcollect */
iter=rand('integer',1,h_zipcollect.num_items);
if h_zipcollect.find()=0 then
do;
/** 2. draw control **/
/* for the current zipcode derive lowest value for seq_no for the row that hasn't been drawn already */
if h_last_ranno.find() ne 0 then
do;
seq_no=1;
_rc=h_last_ranno.add();
end;
/* draw control record */
if h_nodec.find()=0 then
do;
/* count how many control records selected for the current record from dec */
select_cnt=sum(select_cnt,1);
output control;
/* remove selected record from hash as we won't select it again */
_rc=h_nodec.remove();
/* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */
seq_no=sum(seq_no,1);
_rc=h_last_ranno.replace();
end;
else
do;
/* in case all rows for a zipcode have already been drawn don't use this zipcode any further */
_rc=h_zipcollect.remove();
end;
end;
end;
if select_cnt<2 then output control_insufficient_data;
run;
/* title 'control'; */
/* proc print data=control; */
/* run; */
/* title 'Decedent with insufficient matching data to create control'; */
/* proc sql; */
/* select * */
/* from control_insufficient_data; */
/* quit; */
/* title; */
@JKHess wrote:
Thank you @Patrick. Let me take a little time to look through these. I'm working in a virtual research environment, and my workspace limit is apparently 5 Tb. I only know this because the help desk alerted me when I exceeded this when running my cartesian product code..
5TB is disk space where your SAS WORK resides and not memory which will be lower. Running the proc options as provided earlier will tell you how much memory you've got.
@JKHess wrote:
@Patrick, I ran option 1 on the full files, after testing it on a smaller subset, and tracing the results through each of the three input files. It worked! And no issues with memory. Thank you so much...grateful for the help.
@JKHess That's great to hear. Out of curiosity: How much time did it take to run this process with your full data volume?
And just one more thought for you to consider:
I assume that it's statistically acceptable for your 2 year observation period that the records in File 3 are a snapshot without date range (=people moving zipcode not covered). @Tom mentioned this already.
But what about the deciles? If they are per zipcode and date then I believe your analysis will be skewed towards rural low population density zipcodes.
@JKHess I believe the more important question that needs answering is:
"But what about the deciles? If they are per zipcode and date then I believe your analysis will be skewed towards rural low population density zipcodes."
The code you're currently using will randomly select a zipcode in scope to draw a control from. I would assume that the number of beneficiaries in your File3 data differs between zipcodes (I assume zipcodes from rural areas have tendentially less rows). Because the random selection of zipcodes is not weighted by number of rows under a zipcode, people living under a zipcode with less entries will have a higher probability to get selected as control (assumed rural has higher probability than urban).
If above is an issue then the code needs amendment.
@JKHess wrote:
...I agree this is a problem. How can I weight by the number of records for each zipcode?
I believe I have an approach how to implement this change with acceptable impact on performance. I'll give it a go once time.
@JKHess wrote:
...I agree this is a problem. How can I weight by the number of records for each zipcode?
Below the amended code that now should choose id's from any of the zipcodes in scope with the same probability.
The hard bit was to come-up with a change for a weighted random selection of a zipcode where performance doesn't degrade too much. I couldn't avoid implementation of some additional looping. Let me know the runtime (and if it's still manageable).
And once again: Please review both the code and validate the result.
/*************** create sample data ************************/
/* file 1 */
data dec;
input ID Date :mmddyy10. Decile;
format Date mmddyy10.;
datalines;
1 1/1/2017 1
22 1/1/2017 1
41 1/1/2017 1
56 1/1/2017 2
79 1/1/2017 2
85 1/1/2017 2
100 1/2/2017 1
118 1/2/2017 1
125 1/2/2017 2
167 1/2/2017 2
178 1/2/2017 3
;
run;
/* file 2 - not really a bridge because relationship bridge:no_dec is many:many */
data bridge;
input Date :mmddyy10. Decile Zipcode $5.;
format Date mmddyy10.;
datalines;
1/1/2017 1 88123
1/1/2017 1 03867
1/1/2017 1 04001
1/1/2017 2 03304
1/1/2017 2 98765
1/1/2017 2 96224
1/1/2017 2 00001
1/2/2017 1 98801
1/2/2017 2 88123
1/2/2017 2 12345
1/2/2017 2 83356
1/2/2017 2 98765
1/2/2017 3 03304
1/2/2017 3 04945
;
run;
/* file 3 */
data no_dec;
input ID Zipcode $5.;
datalines;
2 88123
21 88123
22 88123
23 88123
24 88123
3 12345
4 03304
5 03867
6 04945
7 04001
8 98765
9 98801
10 96224
11 00001
12 83356
13 83356
;
run;
/************ data prep *************************************/
/* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */
data _null_;
dcl hash h1(ordered:'y', multidata:'y');
h1.defineKey('Zipcode','ran_no');
h1.defineData('id','zipcode','ran_no');
h1.defineDone();
call streaminit(10);
do until(_last);
set no_dec end=_last;
ran_no=rand('uniform');
_rc=h1.add();
end;
_rc=h1.output(dataset:'no_dec_ranno');
stop;
run;
/************ draw control *************************************/
data control(keep=id_dec id zipcode date Decile select_cnt)
control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt)
;
length id_dec 8;
if _n_=1 then
do;
call streaminit(10);
/* define hash to collect number of rows (items) per zipcode */
/* - used for weighted random selection of zipcode from which to draw control from */
n_items=0;
dcl hash h_nodec_nperzip();
h_nodec_nperzip.defineKey('zipcode');
h_nodec_nperzip.defineData('n_items');
h_nodec_nperzip.defineDone();
/* load no_dec_ranno into hash replacing the random values by a sequence number (by zipcode) */
/* - the order is still random but the sequence number instead of a random value will allow to address specific items later on */
/* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB */
dcl hash h_nodec(ordered:'y');
h_nodec.defineKey('Zipcode','seq_no');
h_nodec.defineData('id','zipcode','seq_no');
h_nodec.defineDone();
do until(_last);
set no_dec_ranno(drop=ran_no) end=_last;
by Zipcode;
/* populate hash h_nodec */
if first.zipcode then seq_no=1;
else seq_no+1;
_rc=h_nodec.add();
/* populate hash h_nodec_nperzip */
n_items=sum(n_items,1);
if last.zipcode then
do;
_rc=h_nodec_nperzip.add();
n_items=0;
end;
end;
/* load the bridge data into a hash */
dcl hash h_brdg(dataset:'bridge', multidata:'y', ordered:'y');
h_brdg.defineKey('date','decile');
h_brdg.defineData('zipcode');
h_brdg.defineDone();
/* hash to store per zipcode the last sequence number used to populate the table with control data */
/* - to ensure a record gets only drawn once */
dcl hash h_last_ranno();
h_last_ranno.defineKey('zipcode');
h_last_ranno.defineData('seq_no');
h_last_ranno.defineDone();
/* arrays to store zipcode and cumulative sum of number of items under a zipcode */
array a_zipcode{10000} $5 _temporary_;
array a_itemcum{0:10000} 8 _temporary_;
a_itemcum[0]=0;
end;
call missing(of _all_);
set dec(rename=(id=id_dec));
/*** draw two controls for case ***/
/** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */
/* load all zipcodes from the bridge into hash h_zipcollect that match with the current row from dec (file 1) */
_rc=h_brdg.reset_dup();
do _i=1 by 1 while(h_brdg.do_over() = 0);
_rc=h_nodec_nperzip.find()=0;
a_zipcode[_i]=zipcode;
a_itemcum[_i]=sum(a_itemcum[_i-1],n_items);
end;
a_itemcum_n_elements=sum(_i,-1);
select_cnt=0;
do _i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */
/** random selection of one of the matching zipcodes from array a_zipcode, weighted by number of items per zipcode **/
/* create random integer in the range of 1 to n zipcodes to choose from */
ran_val=rand('integer',1,a_itemcum[a_itemcum_n_elements]);
/* binary search through array a_itemcum to find the element that stores the higher boundary */
/* - when found use the index of this element to derive the zipcode from which to draw control record */
lbound=1;
hbound=a_itemcum[a_itemcum_n_elements];
do while(lbound <= hbound);
midpt=floor(sum(lbound,hbound)/2);
if a_itemcum[midpt-1] >= ran_val then hbound=midpt-1;
else
if a_itemcum[midpt] < ran_val then lbound=midpt+1;
else
if a_itemcum[midpt-1] < ran_val <= a_itemcum[midpt] then
do;
zipcode=a_zipcode[midpt];
leave;
end;
end;
/** 2. draw control from population under selected zip code **/
/* for the chosen zipcode derive the row with the lowest seq_no that hasn't been drawn previously */
if h_last_ranno.find() ne 0 then
do;
seq_no=1;
_rc=h_last_ranno.add();
end;
/* draw control record */
if h_nodec.find()=0 then
do;
/* count how many control records selected for the current record from dec */
select_cnt=sum(select_cnt,1);
output control;
/* remove selected record from hash as we won't select it again */
_rc=h_nodec.remove();
/* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */
seq_no=sum(seq_no,1);
_rc=h_last_ranno.replace();
end;
end;
if select_cnt<2 then output control_insufficient_data;
run;
/* title 'control'; */
/* proc print data=control; */
/* run; */
/* title 'Decedent with insufficient matching data to create control'; */
/* proc sql; */
/* select * */
/* from control_insufficient_data; */
/* quit; */
/* title; */
I do have another design idea how to approach your problem which would likely perform quite a bit better but for which I would need to know more about your data and environment (to not waste memory) to decide if it's feasible. Let's see how above code performs before we go there.
About the address changes:
The dataset had zipcode assigned to each beneficiary annually. About 5% changed zipcode from one calendar year to the next
If you've got annual snapshots then why not run the process for each year separately using the matching yearly snapshot. That shouldn't take that much but return better data which must be in your interest.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.