Solved: How do I randomly select controls from within a large administrative d...

JKHess · Posted 02-02-2025 12:16 AM

I have a large health plan dataset (n~36 million records) that I need to randomly select two non-decedent controls for each decedent case. The matching variable is decile of air pollutant exposure, which was calculated from the full distribution of daily pollutant concentrations for all zip codes occurring in the dataset, across the 2-year study period. I've tried to approach this with 3 files:

File 1 - Decedent cases (n=2.8 million): ID, Date (DeathDt), Decile (on DeathDt)

File 2 - Bridge file: Date, Zipcode, Decile [file contains all possible combinations of date and zipcode in the non-decedent file, with decile assigned based on pollutant concentration on that date for that zipcode]

File 3 - Non-decedent controls (n=34.6 million): ID, Zipcode

The files look like this:

File 1:

ID	Date	Decile
1	1/1/2017	1
22	1/1/2017	1
41	1/1/2017	1
56	1/1/2017	2
79	1/1/2017	2
85	1/1/2017	2
100	1/2/2017	1
118	1/2/2017	1
125	1/2/2017	2
167	1/2/2017	2
178	1/2/2017	3

File 2:

Date	Zipcode	Decile
1/1/2017	12832	1
1/1/2017	03349	1
1/1/2017	04001	2
1/2/2017	56723	2
1/2/2017	88123	1
1/3/2017	80010	3
1/3/2017	96224	3

File 3:

ID	Zipcode
2	88123
3	12345
4	03304
5	03867
6	04945
7	04001
8	98765
9	98801
10	96224
11	00001
12	83356

Again, for each decedent case, I need to randomly select 2 non-decedent controls who were alive as of the case's death date, but this can't be done directly because there is no date in the non-decedent file. I created a cartesian product joining Files 1 and 2, which produced a file named 'combined', and gave me the full range of zipcodes that are in the same decile as the case, on the case's date of death (code below). My plan was to join this file with the non-decedents file by zipcode, and then select 2 controls randomly. But at this point, the files were huge, and I ran out of space...and I don't know really how to do the random selection from here. I'm thinking there has to be a more efficient way to do this. Sorry for the complicated description. Would appreciate any advice for how to do this..

proc sql;

create table combined as

select f1.decile as CaseDecile, f1.date as CaseDate

f2.decile as BridgeDecile, f2.date as BridgeDate, f2.zipcode as BridgeZipcode

from decedents as f1, bridge as f2

where f1.decile = f2.decile AND f1.date = f2.date;

quit;

Patrick · Posted 02-03-2025 11:54 AM

@JKHess
Below my attempt to meet your requirements based on my understanding. I believe the code does what it should but I strongly suggest you take some time to create some really good sample data for testing where you know exactly how the desired result needs to look like.

I've chosen for performance reasons a data step/hash table approach. Your data volumes are serious so depending on the memory you've got things might or might not work.

I'm sharing two coding options (variations). Option 1 requires a bit less data prep but in turn might require up to 7GB of memory, option 2 should already work with 4GB of memory.

You can check your memory settings via below (results are in bytes, zero means normally as much as available).

proc options group=memory;
run;

Option 1:

Spoiler

/*************** create sample data ************************/
/* file 1 */
data dec; 
  input ID Date :mmddyy10. Decile;
  format Date mmddyy10.;
datalines;
1 1/1/2017 1 
22 1/1/2017 1
41 1/1/2017 1
56 1/1/2017 2
79 1/1/2017 2
85 1/1/2017 2
100 1/2/2017 1
118 1/2/2017 1
125 1/2/2017 2
167 1/2/2017 2
178 1/2/2017 3
;
run;

/* file 2 - not really a bridge because relationship bridge:no_dec is many:many */
data bridge; 
  input Date :mmddyy10.  Decile Zipcode $5.;
  format Date mmddyy10.;
datalines;
1/1/2017 1 88123
1/1/2017 1 03867
1/1/2017 1 04001
1/1/2017 2 03304
1/1/2017 2 98765
1/1/2017 2 96224
1/1/2017 2 00001
1/2/2017 1 98801
1/2/2017 2 88123
1/2/2017 2 12345
1/2/2017 2 83356
1/2/2017 2 98765
1/2/2017 3 03304
1/2/2017 3 04945
;
run;

/* file 3 */
data no_dec; 
  input ID Zipcode $5.;
datalines;
2 88123
21 88123
22 88123
23 88123
24 88123
3 12345
4 03304
5 03867
6 04945
7 04001
8 98765
9 98801
10 96224
11 00001
12 83356
13 83356
;
run;

/************  draw control *************************************/

/* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */
data _null_;
	dcl hash h1(ordered:'y', multidata:'y');
	h1.defineKey('Zipcode','ran_no');
	h1.defineData('id','zipcode','ran_no');
	h1.defineDone();
	call streaminit(10);
	do until(_last);
		set no_dec end=_last;
		ran_no=rand('uniform');
		_rc=h1.add();
	end;
	_rc=h1.output(dataset:'no_dec_ranno');
	stop;
run;

data control(keep=id_dec id zipcode date Decile select_cnt)
		 control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt)
		 ;
	length id_dec 8;
	if _n_=1 then
		do;
			call streaminit(10);
			/* load no_dec_ranno into hash replacing the random values by a counter 																							 		*/
			/* - the order is still random but the counter instead of a random value will allow to address specific items later on 		*/
			/* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB	*/
			dcl hash h_nodec(ordered:'y');
			h_nodec.defineKey('Zipcode','seq_no');
			h_nodec.defineData('id','zipcode','seq_no');
			h_nodec.defineDone();
			do until(_last);
				set no_dec_ranno(drop=ran_no) end=_last;
				by Zipcode;
				if first.zipcode then seq_no=1;
				else seq_no+1;
				_rc=h_nodec.add();
			end;
			/* load the bridge data into a hash */
			dcl hash h_brdg(dataset:'bridge', multidata:'y', ordered:'y');
			h_brdg.defineKey('date','decile');
			h_brdg.defineData('zipcode');
			h_brdg.defineDone();
			/* hash to store per zipcode the last seq_no (counter) used to populate the table with control data */
			dcl hash h_last_ranno();
			h_last_ranno.defineKey('zipcode');
			h_last_ranno.defineData('seq_no');
			h_last_ranno.defineDone();
			/* hash to store for the current row of dec (file 1) all matching rows from the bridge table */ 
			dcl hash h_zipcollect();
			h_zipcollect.defineKey('iter');
			h_zipcollect.defineData('zipcode');
			h_zipcollect.defineDone();
			
		end;
	call missing(of _all_);
	
	set dec(rename=(id=id_dec));
	
	/*** draw two controls for case ***/
	select_cnt=0;
	/** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */ 
	/* load all zipcodes from the bridge into hash h_zipcollect that match with the current row from dec (file 1) */
	_rc=h_zipcollect.clear();
	do iter=1 by 1 while(h_brdg.do_over() = 0);
		_rc=h_zipcollect.add();
	end;
	
	do i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */
		/* random selection of one of the matching zipcodes from hash h_zipcollect */
		iter=rand('integer',1,h_zipcollect.num_items);
		
		if h_zipcollect.find()=0 then
			do;
				/** 2. draw control **/
				/* for the current zipcode derive lowest value for seq_no for the row that hasn't been drawn already */
				if h_last_ranno.find() ne 0 then
					do;
						seq_no=1;
						_rc=h_last_ranno.add();
					end;
				/* draw control record */	
				if h_nodec.find()=0 then
					do;
						/* count how many control records selected for the current record from dec */
						select_cnt=sum(select_cnt,1);
						output control;
						/* remove selected record from hash as we won't select it again */
						_rc=h_nodec.remove();
						/* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */
						seq_no=sum(seq_no,1);
						_rc=h_last_ranno.replace();
					end;
				else
					do;
						/* in case all rows for a zipcode have already been drawn don't use this zipcode any further */
						_rc=h_zipcollect.remove();					
					end;
			end;
	end;
	
	if select_cnt<2 then output control_insufficient_data;
run;

/* title 'control'; */
/* proc print data=control; */
/* run; */
/* title 'Decedent with insufficient matching data to create control'; */
/* proc sql; */
/* 	select * */
/* 	from control_insufficient_data; */
/* quit; */
/* title; */

/*************** create sample data ************************/ /* file 1 */ data dec; input ID Date :mmddyy10. Decile; format Date mmddyy10.; datalines; 1 1/1/2017 1 22 1/1/2017 1 41 1/1/2017 1 56 1/1/2017 2 79 1/1/2017 2 85 1/1/2017 2 100 1/2/2017 1 118 1/2/2017 1 125 1/2/2017 2 167 1/2/2017 2 178 1/2/2017 3 ; run; /* file 2 - not really a bridge because relationship bridge:no_dec is many:many */ data bridge; input Date :mmddyy10. Decile Zipcode $5.; format Date mmddyy10.; datalines; 1/1/2017 1 88123 1/1/2017 1 03867 1/1/2017 1 04001 1/1/2017 2 03304 1/1/2017 2 98765 1/1/2017 2 96224 1/1/2017 2 00001 1/2/2017 1 98801 1/2/2017 2 88123 1/2/2017 2 12345 1/2/2017 2 83356 1/2/2017 2 98765 1/2/2017 3 03304 1/2/2017 3 04945 ; run; /* file 3 */ data no_dec; input ID Zipcode $5.; datalines; 2 88123 21 88123 22 88123 23 88123 24 88123 3 12345 4 03304 5 03867 6 04945 7 04001 8 98765 9 98801 10 96224 11 00001 12 83356 13 83356 ; run; /************ draw control *************************************/ /* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */ data _null_; dcl hash h1(ordered:'y', multidata:'y'); h1.defineKey('Zipcode','ran_no'); h1.defineData('id','zipcode','ran_no'); h1.defineDone(); call streaminit(10); do until(_last); set no_dec end=_last; ran_no=rand('uniform'); _rc=h1.add(); end; _rc=h1.output(dataset:'no_dec_ranno'); stop; run; data control(keep=id_dec id zipcode date Decile select_cnt) control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt) ; length id_dec 8; if _n_=1 then do; call streaminit(10); /* load no_dec_ranno into hash replacing the random values by a counter */ /* - the order is still random but the counter instead of a random value will allow to address specific items later on */ /* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB */ dcl hash h_nodec(ordered:'y'); h_nodec.defineKey('Zipcode','seq_no'); h_nodec.defineData('id','zipcode','seq_no'); h_nodec.defineDone(); do until(_last); set no_dec_ranno(drop=ran_no) end=_last; by Zipcode; if first.zipcode then seq_no=1; else seq_no+1; _rc=h_nodec.add(); end; /* load the bridge data into a hash */ dcl hash h_brdg(dataset:'bridge', multidata:'y', ordered:'y'); h_brdg.defineKey('date','decile'); h_brdg.defineData('zipcode'); h_brdg.defineDone(); /* hash to store per zipcode the last seq_no (counter) used to populate the table with control data */ dcl hash h_last_ranno(); h_last_ranno.defineKey('zipcode'); h_last_ranno.defineData('seq_no'); h_last_ranno.defineDone(); /* hash to store for the current row of dec (file 1) all matching rows from the bridge table */ dcl hash h_zipcollect(); h_zipcollect.defineKey('iter'); h_zipcollect.defineData('zipcode'); h_zipcollect.defineDone(); end; call missing(of _all_); set dec(rename=(id=id_dec)); /*** draw two controls for case ***/ select_cnt=0; /** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */ /* load all zipcodes from the bridge into hash h_zipcollect that match with the current row from dec (file 1) */ _rc=h_zipcollect.clear(); do iter=1 by 1 while(h_brdg.do_over() = 0); _rc=h_zipcollect.add(); end; do i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */ /* random selection of one of the matching zipcodes from hash h_zipcollect */ iter=rand('integer',1,h_zipcollect.num_items); if h_zipcollect.find()=0 then do; /** 2. draw control **/ /* for the current zipcode derive lowest value for seq_no for the row that hasn't been drawn already */ if h_last_ranno.find() ne 0 then do; seq_no=1; _rc=h_last_ranno.add(); end; /* draw control record */ if h_nodec.find()=0 then do; /* count how many control records selected for the current record from dec */ select_cnt=sum(select_cnt,1); output control; /* remove selected record from hash as we won't select it again */ _rc=h_nodec.remove(); /* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */ seq_no=sum(seq_no,1); _rc=h_last_ranno.replace(); end; else do; /* in case all rows for a zipcode have already been drawn don't use this zipcode any further */ _rc=h_zipcollect.remove(); end; end; end; if select_cnt<2 then output control_insufficient_data; run; /* title 'control'; */ /* proc print data=control; */ /* run; */ /* title 'Decedent with insufficient matching data to create control'; */ /* proc sql; */ /* select * */ /* from control_insufficient_data; */ /* quit; */ /* title; */

Option 2:

Spoiler

/*************** create sample data ************************/
/* file 1 */
data dec; 
  input ID Date :mmddyy10. Decile;
  format Date mmddyy10.;
datalines;
1 1/1/2017 1 
22 1/1/2017 1
41 1/1/2017 1
56 1/1/2017 2
79 1/1/2017 2
85 1/1/2017 2
100 1/2/2017 1
118 1/2/2017 1
125 1/2/2017 2
167 1/2/2017 2
178 1/2/2017 3
;
run;

/* file 2 - not really a bridge because relationship bridge:no_dec is many:many */
data bridge; 
  input Date :mmddyy10.  Decile Zipcode $5.;
  format Date mmddyy10.;
datalines;
1/1/2017 1 88123
1/1/2017 1 03867
1/1/2017 1 04001
1/1/2017 2 03304
1/1/2017 2 98765
1/1/2017 2 96224
1/1/2017 2 00001
1/2/2017 1 98801
1/2/2017 2 88123
1/2/2017 2 12345
1/2/2017 2 83356
1/2/2017 2 98765
1/2/2017 3 03304
1/2/2017 3 04945
;
run;

/* file 3 */
data no_dec; 
  input ID Zipcode $5.;
datalines;
2 88123
21 88123
22 88123
23 88123
24 88123
3 12345
4 03304
5 03867
6 04945
7 04001
8 98765
9 98801
10 96224
11 00001
12 83356
13 83356
;
run;

/************  date prep *************************************/

/* sort dec by date and decile to allow for by group processing */
proc sort data=dec presorted;
	by date decile;
run;

/* sort bridge by date and decile to allow for by group processing 															*/
/* - required because bridge table likely to big for loading in total into hash table */
proc sort data=bridge presorted;
	by date decile;
run;
proc sql;
	create index idx_dt_dec
	on bridge (date, decile);
quit;

/* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */
data _null_;
	dcl hash h1(ordered:'y', multidata:'y');
	h1.defineKey('Zipcode','ran_no');
	h1.defineData('id','zipcode','ran_no');
	h1.defineDone();
	call streaminit(10);
	do until(_last);
		set no_dec end=_last;
		ran_no=rand('uniform');
		_rc=h1.add();
	end;
	_rc=h1.output(dataset:'no_dec_ranno');
	stop;
run;

/************  draw control *************************************/
data control(keep=id_dec id zipcode date Decile select_cnt)
		 control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt)
		 ;
	length id_dec 8;
	if _n_=1 then
		do;
			call streaminit(10);
			/* load no_dec_ranno into hash replacing the random values by a counter 																							 		*/
			/* - the order is still random but the counter instead of a random value will allow to address specific items later on 		*/
			/* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB	*/
			dcl hash h_nodec(ordered:'y');
			h_nodec.defineKey('Zipcode','seq_no');
			h_nodec.defineData('id','zipcode','seq_no');
			h_nodec.defineDone();
			do until(_last);
				set no_dec_ranno(drop=ran_no) end=_last;
				by Zipcode;
				if first.zipcode then seq_no=1;
				else seq_no+1;
				_rc=h_nodec.add();
			end;
			/* create hash for bridge data */
			dcl hash h_brdg(dataset:'bridge(obs=0)', multidata:'y', ordered:'y');
			h_brdg.defineKey('date','decile');
			h_brdg.defineData('zipcode');
			h_brdg.defineDone();
			/* hash to store per zipcode the last seq_no (counter) used to populate the table with control data */
			dcl hash h_last_ranno();
			h_last_ranno.defineKey('zipcode');
			h_last_ranno.defineData('seq_no');
			h_last_ranno.defineDone();
			/* hash to store for the current row of dec (file 1) all matching rows from the bridge table */ 
			dcl hash h_zipcollect();
			h_zipcollect.defineKey('iter');
			h_zipcollect.defineData('zipcode');
			h_zipcollect.defineDone();
		end;
	call missing(of _all_);
	
	set dec(rename=(id=id_dec));
	by date decile;
	
	if first.decile then
		do;
			/* load bridge data for current date into hash h_brdg */
			_rc=h_brdg.delete();
			h_brdg = _new_ hash(dataset:catx(' ','bridge(where=(date=',date,'and decile=',decile,'))'), multidata:'y', ordered:'y');
			h_brdg.defineKey('date','decile');
			h_brdg.defineData('zipcode');
			h_brdg.defineDone();

			/* "copy" h_brdg to h_zipcollect that got a sequence key (iter) which allows to directly address items */
			_rc=h_zipcollect.clear();
			do iter=1 by 1 while(h_brdg.do_over() = 0);
				_rc=h_zipcollect.add();
			end;
		end;
	
	/*** draw two controls per row in dec (file 1) ***/
	select_cnt=0;
	/** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */ 
	do i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */
		/* random selection of one of the zipcodes in hash h_zipcollect */
		iter=rand('integer',1,h_zipcollect.num_items);
		
		if h_zipcollect.find()=0 then
			do;
				/** 2. draw control **/
				/* for the current zipcode derive lowest value for seq_no for the row that hasn't been drawn already */
				if h_last_ranno.find() ne 0 then
					do;
						seq_no=1;
						_rc=h_last_ranno.add();
					end;
				/* draw control record */	
				if h_nodec.find()=0 then
					do;
						/* count how many control records selected for the current record from dec */
						select_cnt=sum(select_cnt,1);
						output control;
						/* remove selected record from hash as we won't select it again */
						_rc=h_nodec.remove();
						/* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */
						seq_no=sum(seq_no,1);
						_rc=h_last_ranno.replace();
					end;
				else
					do;
						/* in case all rows for a zipcode have already been drawn don't use this zipcode any further */
						_rc=h_zipcollect.remove();					
					end;
			end;
	end;
	
	if select_cnt<2 then output control_insufficient_data;
run;

/* title 'control'; */
/* proc print data=control; */
/* run; */
/* title 'Decedent with insufficient matching data to create control'; */
/* proc sql; */
/* 	select * */
/* 	from control_insufficient_data; */
/* quit; */
/* title; */

/*************** create sample data ************************/ /* file 1 */ data dec; input ID Date :mmddyy10. Decile; format Date mmddyy10.; datalines; 1 1/1/2017 1 22 1/1/2017 1 41 1/1/2017 1 56 1/1/2017 2 79 1/1/2017 2 85 1/1/2017 2 100 1/2/2017 1 118 1/2/2017 1 125 1/2/2017 2 167 1/2/2017 2 178 1/2/2017 3 ; run; /* file 2 - not really a bridge because relationship bridge:no_dec is many:many */ data bridge; input Date :mmddyy10. Decile Zipcode $5.; format Date mmddyy10.; datalines; 1/1/2017 1 88123 1/1/2017 1 03867 1/1/2017 1 04001 1/1/2017 2 03304 1/1/2017 2 98765 1/1/2017 2 96224 1/1/2017 2 00001 1/2/2017 1 98801 1/2/2017 2 88123 1/2/2017 2 12345 1/2/2017 2 83356 1/2/2017 2 98765 1/2/2017 3 03304 1/2/2017 3 04945 ; run; /* file 3 */ data no_dec; input ID Zipcode $5.; datalines; 2 88123 21 88123 22 88123 23 88123 24 88123 3 12345 4 03304 5 03867 6 04945 7 04001 8 98765 9 98801 10 96224 11 00001 12 83356 13 83356 ; run; /************ date prep *************************************/ /* sort dec by date and decile to allow for by group processing */ proc sort data=dec presorted; by date decile; run; /* sort bridge by date and decile to allow for by group processing */ /* - required because bridge table likely to big for loading in total into hash table */ proc sort data=bridge presorted; by date decile; run; proc sql; create index idx_dt_dec on bridge (date, decile); quit; /* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */ data _null_; dcl hash h1(ordered:'y', multidata:'y'); h1.defineKey('Zipcode','ran_no'); h1.defineData('id','zipcode','ran_no'); h1.defineDone(); call streaminit(10); do until(_last); set no_dec end=_last; ran_no=rand('uniform'); _rc=h1.add(); end; _rc=h1.output(dataset:'no_dec_ranno'); stop; run; /************ draw control *************************************/ data control(keep=id_dec id zipcode date Decile select_cnt) control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt) ; length id_dec 8; if _n_=1 then do; call streaminit(10); /* load no_dec_ranno into hash replacing the random values by a counter */ /* - the order is still random but the counter instead of a random value will allow to address specific items later on */ /* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB */ dcl hash h_nodec(ordered:'y'); h_nodec.defineKey('Zipcode','seq_no'); h_nodec.defineData('id','zipcode','seq_no'); h_nodec.defineDone(); do until(_last); set no_dec_ranno(drop=ran_no) end=_last; by Zipcode; if first.zipcode then seq_no=1; else seq_no+1; _rc=h_nodec.add(); end; /* create hash for bridge data */ dcl hash h_brdg(dataset:'bridge(obs=0)', multidata:'y', ordered:'y'); h_brdg.defineKey('date','decile'); h_brdg.defineData('zipcode'); h_brdg.defineDone(); /* hash to store per zipcode the last seq_no (counter) used to populate the table with control data */ dcl hash h_last_ranno(); h_last_ranno.defineKey('zipcode'); h_last_ranno.defineData('seq_no'); h_last_ranno.defineDone(); /* hash to store for the current row of dec (file 1) all matching rows from the bridge table */ dcl hash h_zipcollect(); h_zipcollect.defineKey('iter'); h_zipcollect.defineData('zipcode'); h_zipcollect.defineDone(); end; call missing(of _all_); set dec(rename=(id=id_dec)); by date decile; if first.decile then do; /* load bridge data for current date into hash h_brdg */ _rc=h_brdg.delete(); h_brdg = _new_ hash(dataset:catx(' ','bridge(where=(date=',date,'and decile=',decile,'))'), multidata:'y', ordered:'y'); h_brdg.defineKey('date','decile'); h_brdg.defineData('zipcode'); h_brdg.defineDone(); /* "copy" h_brdg to h_zipcollect that got a sequence key (iter) which allows to directly address items */ _rc=h_zipcollect.clear(); do iter=1 by 1 while(h_brdg.do_over() = 0); _rc=h_zipcollect.add(); end; end; /*** draw two controls per row in dec (file 1) ***/ select_cnt=0; /** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */ do i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */ /* random selection of one of the zipcodes in hash h_zipcollect */ iter=rand('integer',1,h_zipcollect.num_items); if h_zipcollect.find()=0 then do; /** 2. draw control **/ /* for the current zipcode derive lowest value for seq_no for the row that hasn't been drawn already */ if h_last_ranno.find() ne 0 then do; seq_no=1; _rc=h_last_ranno.add(); end; /* draw control record */ if h_nodec.find()=0 then do; /* count how many control records selected for the current record from dec */ select_cnt=sum(select_cnt,1); output control; /* remove selected record from hash as we won't select it again */ _rc=h_nodec.remove(); /* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */ seq_no=sum(seq_no,1); _rc=h_last_ranno.replace(); end; else do; /* in case all rows for a zipcode have already been drawn don't use this zipcode any further */ _rc=h_zipcollect.remove(); end; end; end; if select_cnt<2 then output control_insufficient_data; run; /* title 'control'; */ /* proc print data=control; */ /* run; */ /* title 'Decedent with insufficient matching data to create control'; */ /* proc sql; */ /* select * */ /* from control_insufficient_data; */ /* quit; */ /* title; */

View solution in original post

Patrick · Posted 02-02-2025 03:52 AM

You call your file 3 "Non-decedent controls". Does this mean this file doesn't contain any id's of deceased people? Or would we need to identify these via file 1?

When selecting two controls for a deceased case are they then excluded for further cases or can they get selected twice. Basically after the selection process using file 1 with 2.8 million rows do you expect a control file with 5.6 million rows (distinct id's) or could it be less?

JKHess · Posted 02-02-2025 09:02 AM

Hi @Patrick, the decedents file has all beneficiaries who died during the two year period, and non-decedents are the rest. I removed the overlap caused by having two separate years of data. So decedents cannot be selected as controls prior to their death date, and non-decedents can only be selected once - they are not put back into the pool of controls once selected.

Patrick · Posted 02-03-2025 11:54 AM

@JKHess
Below my attempt to meet your requirements based on my understanding. I believe the code does what it should but I strongly suggest you take some time to create some really good sample data for testing where you know exactly how the desired result needs to look like.

I've chosen for performance reasons a data step/hash table approach. Your data volumes are serious so depending on the memory you've got things might or might not work.

I'm sharing two coding options (variations). Option 1 requires a bit less data prep but in turn might require up to 7GB of memory, option 2 should already work with 4GB of memory.

You can check your memory settings via below (results are in bytes, zero means normally as much as available).

proc options group=memory;
run;

Option 1:

Spoiler

/*************** create sample data ************************/
/* file 1 */
data dec; 
  input ID Date :mmddyy10. Decile;
  format Date mmddyy10.;
datalines;
1 1/1/2017 1 
22 1/1/2017 1
41 1/1/2017 1
56 1/1/2017 2
79 1/1/2017 2
85 1/1/2017 2
100 1/2/2017 1
118 1/2/2017 1
125 1/2/2017 2
167 1/2/2017 2
178 1/2/2017 3
;
run;

/* file 2 - not really a bridge because relationship bridge:no_dec is many:many */
data bridge; 
  input Date :mmddyy10.  Decile Zipcode $5.;
  format Date mmddyy10.;
datalines;
1/1/2017 1 88123
1/1/2017 1 03867
1/1/2017 1 04001
1/1/2017 2 03304
1/1/2017 2 98765
1/1/2017 2 96224
1/1/2017 2 00001
1/2/2017 1 98801
1/2/2017 2 88123
1/2/2017 2 12345
1/2/2017 2 83356
1/2/2017 2 98765
1/2/2017 3 03304
1/2/2017 3 04945
;
run;

/* file 3 */
data no_dec; 
  input ID Zipcode $5.;
datalines;
2 88123
21 88123
22 88123
23 88123
24 88123
3 12345
4 03304
5 03867
6 04945
7 04001
8 98765
9 98801
10 96224
11 00001
12 83356
13 83356
;
run;

/************  draw control *************************************/

/* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */
data _null_;
	dcl hash h1(ordered:'y', multidata:'y');
	h1.defineKey('Zipcode','ran_no');
	h1.defineData('id','zipcode','ran_no');
	h1.defineDone();
	call streaminit(10);
	do until(_last);
		set no_dec end=_last;
		ran_no=rand('uniform');
		_rc=h1.add();
	end;
	_rc=h1.output(dataset:'no_dec_ranno');
	stop;
run;

data control(keep=id_dec id zipcode date Decile select_cnt)
		 control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt)
		 ;
	length id_dec 8;
	if _n_=1 then
		do;
			call streaminit(10);
			/* load no_dec_ranno into hash replacing the random values by a counter 																							 		*/
			/* - the order is still random but the counter instead of a random value will allow to address specific items later on 		*/
			/* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB	*/
			dcl hash h_nodec(ordered:'y');
			h_nodec.defineKey('Zipcode','seq_no');
			h_nodec.defineData('id','zipcode','seq_no');
			h_nodec.defineDone();
			do until(_last);
				set no_dec_ranno(drop=ran_no) end=_last;
				by Zipcode;
				if first.zipcode then seq_no=1;
				else seq_no+1;
				_rc=h_nodec.add();
			end;
			/* load the bridge data into a hash */
			dcl hash h_brdg(dataset:'bridge', multidata:'y', ordered:'y');
			h_brdg.defineKey('date','decile');
			h_brdg.defineData('zipcode');
			h_brdg.defineDone();
			/* hash to store per zipcode the last seq_no (counter) used to populate the table with control data */
			dcl hash h_last_ranno();
			h_last_ranno.defineKey('zipcode');
			h_last_ranno.defineData('seq_no');
			h_last_ranno.defineDone();
			/* hash to store for the current row of dec (file 1) all matching rows from the bridge table */ 
			dcl hash h_zipcollect();
			h_zipcollect.defineKey('iter');
			h_zipcollect.defineData('zipcode');
			h_zipcollect.defineDone();
			
		end;
	call missing(of _all_);
	
	set dec(rename=(id=id_dec));
	
	/*** draw two controls for case ***/
	select_cnt=0;
	/** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */ 
	/* load all zipcodes from the bridge into hash h_zipcollect that match with the current row from dec (file 1) */
	_rc=h_zipcollect.clear();
	do iter=1 by 1 while(h_brdg.do_over() = 0);
		_rc=h_zipcollect.add();
	end;
	
	do i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */
		/* random selection of one of the matching zipcodes from hash h_zipcollect */
		iter=rand('integer',1,h_zipcollect.num_items);
		
		if h_zipcollect.find()=0 then
			do;
				/** 2. draw control **/
				/* for the current zipcode derive lowest value for seq_no for the row that hasn't been drawn already */
				if h_last_ranno.find() ne 0 then
					do;
						seq_no=1;
						_rc=h_last_ranno.add();
					end;
				/* draw control record */	
				if h_nodec.find()=0 then
					do;
						/* count how many control records selected for the current record from dec */
						select_cnt=sum(select_cnt,1);
						output control;
						/* remove selected record from hash as we won't select it again */
						_rc=h_nodec.remove();
						/* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */
						seq_no=sum(seq_no,1);
						_rc=h_last_ranno.replace();
					end;
				else
					do;
						/* in case all rows for a zipcode have already been drawn don't use this zipcode any further */
						_rc=h_zipcollect.remove();					
					end;
			end;
	end;
	
	if select_cnt<2 then output control_insufficient_data;
run;

/* title 'control'; */
/* proc print data=control; */
/* run; */
/* title 'Decedent with insufficient matching data to create control'; */
/* proc sql; */
/* 	select * */
/* 	from control_insufficient_data; */
/* quit; */
/* title; */

/*************** create sample data ************************/ /* file 1 */ data dec; input ID Date :mmddyy10. Decile; format Date mmddyy10.; datalines; 1 1/1/2017 1 22 1/1/2017 1 41 1/1/2017 1 56 1/1/2017 2 79 1/1/2017 2 85 1/1/2017 2 100 1/2/2017 1 118 1/2/2017 1 125 1/2/2017 2 167 1/2/2017 2 178 1/2/2017 3 ; run; /* file 2 - not really a bridge because relationship bridge:no_dec is many:many */ data bridge; input Date :mmddyy10. Decile Zipcode $5.; format Date mmddyy10.; datalines; 1/1/2017 1 88123 1/1/2017 1 03867 1/1/2017 1 04001 1/1/2017 2 03304 1/1/2017 2 98765 1/1/2017 2 96224 1/1/2017 2 00001 1/2/2017 1 98801 1/2/2017 2 88123 1/2/2017 2 12345 1/2/2017 2 83356 1/2/2017 2 98765 1/2/2017 3 03304 1/2/2017 3 04945 ; run; /* file 3 */ data no_dec; input ID Zipcode $5.; datalines; 2 88123 21 88123 22 88123 23 88123 24 88123 3 12345 4 03304 5 03867 6 04945 7 04001 8 98765 9 98801 10 96224 11 00001 12 83356 13 83356 ; run; /************ draw control *************************************/ /* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */ data _null_; dcl hash h1(ordered:'y', multidata:'y'); h1.defineKey('Zipcode','ran_no'); h1.defineData('id','zipcode','ran_no'); h1.defineDone(); call streaminit(10); do until(_last); set no_dec end=_last; ran_no=rand('uniform'); _rc=h1.add(); end; _rc=h1.output(dataset:'no_dec_ranno'); stop; run; data control(keep=id_dec id zipcode date Decile select_cnt) control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt) ; length id_dec 8; if _n_=1 then do; call streaminit(10); /* load no_dec_ranno into hash replacing the random values by a counter */ /* - the order is still random but the counter instead of a random value will allow to address specific items later on */ /* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB */ dcl hash h_nodec(ordered:'y'); h_nodec.defineKey('Zipcode','seq_no'); h_nodec.defineData('id','zipcode','seq_no'); h_nodec.defineDone(); do until(_last); set no_dec_ranno(drop=ran_no) end=_last; by Zipcode; if first.zipcode then seq_no=1; else seq_no+1; _rc=h_nodec.add(); end; /* load the bridge data into a hash */ dcl hash h_brdg(dataset:'bridge', multidata:'y', ordered:'y'); h_brdg.defineKey('date','decile'); h_brdg.defineData('zipcode'); h_brdg.defineDone(); /* hash to store per zipcode the last seq_no (counter) used to populate the table with control data */ dcl hash h_last_ranno(); h_last_ranno.defineKey('zipcode'); h_last_ranno.defineData('seq_no'); h_last_ranno.defineDone(); /* hash to store for the current row of dec (file 1) all matching rows from the bridge table */ dcl hash h_zipcollect(); h_zipcollect.defineKey('iter'); h_zipcollect.defineData('zipcode'); h_zipcollect.defineDone(); end; call missing(of _all_); set dec(rename=(id=id_dec)); /*** draw two controls for case ***/ select_cnt=0; /** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */ /* load all zipcodes from the bridge into hash h_zipcollect that match with the current row from dec (file 1) */ _rc=h_zipcollect.clear(); do iter=1 by 1 while(h_brdg.do_over() = 0); _rc=h_zipcollect.add(); end; do i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */ /* random selection of one of the matching zipcodes from hash h_zipcollect */ iter=rand('integer',1,h_zipcollect.num_items); if h_zipcollect.find()=0 then do; /** 2. draw control **/ /* for the current zipcode derive lowest value for seq_no for the row that hasn't been drawn already */ if h_last_ranno.find() ne 0 then do; seq_no=1; _rc=h_last_ranno.add(); end; /* draw control record */ if h_nodec.find()=0 then do; /* count how many control records selected for the current record from dec */ select_cnt=sum(select_cnt,1); output control; /* remove selected record from hash as we won't select it again */ _rc=h_nodec.remove(); /* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */ seq_no=sum(seq_no,1); _rc=h_last_ranno.replace(); end; else do; /* in case all rows for a zipcode have already been drawn don't use this zipcode any further */ _rc=h_zipcollect.remove(); end; end; end; if select_cnt<2 then output control_insufficient_data; run; /* title 'control'; */ /* proc print data=control; */ /* run; */ /* title 'Decedent with insufficient matching data to create control'; */ /* proc sql; */ /* select * */ /* from control_insufficient_data; */ /* quit; */ /* title; */

Option 2:

Spoiler

/*************** create sample data ************************/
/* file 1 */
data dec; 
  input ID Date :mmddyy10. Decile;
  format Date mmddyy10.;
datalines;
1 1/1/2017 1 
22 1/1/2017 1
41 1/1/2017 1
56 1/1/2017 2
79 1/1/2017 2
85 1/1/2017 2
100 1/2/2017 1
118 1/2/2017 1
125 1/2/2017 2
167 1/2/2017 2
178 1/2/2017 3
;
run;

/* file 2 - not really a bridge because relationship bridge:no_dec is many:many */
data bridge; 
  input Date :mmddyy10.  Decile Zipcode $5.;
  format Date mmddyy10.;
datalines;
1/1/2017 1 88123
1/1/2017 1 03867
1/1/2017 1 04001
1/1/2017 2 03304
1/1/2017 2 98765
1/1/2017 2 96224
1/1/2017 2 00001
1/2/2017 1 98801
1/2/2017 2 88123
1/2/2017 2 12345
1/2/2017 2 83356
1/2/2017 2 98765
1/2/2017 3 03304
1/2/2017 3 04945
;
run;

/* file 3 */
data no_dec; 
  input ID Zipcode $5.;
datalines;
2 88123
21 88123
22 88123
23 88123
24 88123
3 12345
4 03304
5 03867
6 04945
7 04001
8 98765
9 98801
10 96224
11 00001
12 83356
13 83356
;
run;

/************  date prep *************************************/

/* sort dec by date and decile to allow for by group processing */
proc sort data=dec presorted;
	by date decile;
run;

/* sort bridge by date and decile to allow for by group processing 															*/
/* - required because bridge table likely to big for loading in total into hash table */
proc sort data=bridge presorted;
	by date decile;
run;
proc sql;
	create index idx_dt_dec
	on bridge (date, decile);
quit;

/* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */
data _null_;
	dcl hash h1(ordered:'y', multidata:'y');
	h1.defineKey('Zipcode','ran_no');
	h1.defineData('id','zipcode','ran_no');
	h1.defineDone();
	call streaminit(10);
	do until(_last);
		set no_dec end=_last;
		ran_no=rand('uniform');
		_rc=h1.add();
	end;
	_rc=h1.output(dataset:'no_dec_ranno');
	stop;
run;

/************  draw control *************************************/
data control(keep=id_dec id zipcode date Decile select_cnt)
		 control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt)
		 ;
	length id_dec 8;
	if _n_=1 then
		do;
			call streaminit(10);
			/* load no_dec_ranno into hash replacing the random values by a counter 																							 		*/
			/* - the order is still random but the counter instead of a random value will allow to address specific items later on 		*/
			/* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB	*/
			dcl hash h_nodec(ordered:'y');
			h_nodec.defineKey('Zipcode','seq_no');
			h_nodec.defineData('id','zipcode','seq_no');
			h_nodec.defineDone();
			do until(_last);
				set no_dec_ranno(drop=ran_no) end=_last;
				by Zipcode;
				if first.zipcode then seq_no=1;
				else seq_no+1;
				_rc=h_nodec.add();
			end;
			/* create hash for bridge data */
			dcl hash h_brdg(dataset:'bridge(obs=0)', multidata:'y', ordered:'y');
			h_brdg.defineKey('date','decile');
			h_brdg.defineData('zipcode');
			h_brdg.defineDone();
			/* hash to store per zipcode the last seq_no (counter) used to populate the table with control data */
			dcl hash h_last_ranno();
			h_last_ranno.defineKey('zipcode');
			h_last_ranno.defineData('seq_no');
			h_last_ranno.defineDone();
			/* hash to store for the current row of dec (file 1) all matching rows from the bridge table */ 
			dcl hash h_zipcollect();
			h_zipcollect.defineKey('iter');
			h_zipcollect.defineData('zipcode');
			h_zipcollect.defineDone();
		end;
	call missing(of _all_);
	
	set dec(rename=(id=id_dec));
	by date decile;
	
	if first.decile then
		do;
			/* load bridge data for current date into hash h_brdg */
			_rc=h_brdg.delete();
			h_brdg = _new_ hash(dataset:catx(' ','bridge(where=(date=',date,'and decile=',decile,'))'), multidata:'y', ordered:'y');
			h_brdg.defineKey('date','decile');
			h_brdg.defineData('zipcode');
			h_brdg.defineDone();

			/* "copy" h_brdg to h_zipcollect that got a sequence key (iter) which allows to directly address items */
			_rc=h_zipcollect.clear();
			do iter=1 by 1 while(h_brdg.do_over() = 0);
				_rc=h_zipcollect.add();
			end;
		end;
	
	/*** draw two controls per row in dec (file 1) ***/
	select_cnt=0;
	/** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */ 
	do i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */
		/* random selection of one of the zipcodes in hash h_zipcollect */
		iter=rand('integer',1,h_zipcollect.num_items);
		
		if h_zipcollect.find()=0 then
			do;
				/** 2. draw control **/
				/* for the current zipcode derive lowest value for seq_no for the row that hasn't been drawn already */
				if h_last_ranno.find() ne 0 then
					do;
						seq_no=1;
						_rc=h_last_ranno.add();
					end;
				/* draw control record */	
				if h_nodec.find()=0 then
					do;
						/* count how many control records selected for the current record from dec */
						select_cnt=sum(select_cnt,1);
						output control;
						/* remove selected record from hash as we won't select it again */
						_rc=h_nodec.remove();
						/* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */
						seq_no=sum(seq_no,1);
						_rc=h_last_ranno.replace();
					end;
				else
					do;
						/* in case all rows for a zipcode have already been drawn don't use this zipcode any further */
						_rc=h_zipcollect.remove();					
					end;
			end;
	end;
	
	if select_cnt<2 then output control_insufficient_data;
run;

/* title 'control'; */
/* proc print data=control; */
/* run; */
/* title 'Decedent with insufficient matching data to create control'; */
/* proc sql; */
/* 	select * */
/* 	from control_insufficient_data; */
/* quit; */
/* title; */

/*************** create sample data ************************/ /* file 1 */ data dec; input ID Date :mmddyy10. Decile; format Date mmddyy10.; datalines; 1 1/1/2017 1 22 1/1/2017 1 41 1/1/2017 1 56 1/1/2017 2 79 1/1/2017 2 85 1/1/2017 2 100 1/2/2017 1 118 1/2/2017 1 125 1/2/2017 2 167 1/2/2017 2 178 1/2/2017 3 ; run; /* file 2 - not really a bridge because relationship bridge:no_dec is many:many */ data bridge; input Date :mmddyy10. Decile Zipcode $5.; format Date mmddyy10.; datalines; 1/1/2017 1 88123 1/1/2017 1 03867 1/1/2017 1 04001 1/1/2017 2 03304 1/1/2017 2 98765 1/1/2017 2 96224 1/1/2017 2 00001 1/2/2017 1 98801 1/2/2017 2 88123 1/2/2017 2 12345 1/2/2017 2 83356 1/2/2017 2 98765 1/2/2017 3 03304 1/2/2017 3 04945 ; run; /* file 3 */ data no_dec; input ID Zipcode $5.; datalines; 2 88123 21 88123 22 88123 23 88123 24 88123 3 12345 4 03304 5 03867 6 04945 7 04001 8 98765 9 98801 10 96224 11 00001 12 83356 13 83356 ; run; /************ date prep *************************************/ /* sort dec by date and decile to allow for by group processing */ proc sort data=dec presorted; by date decile; run; /* sort bridge by date and decile to allow for by group processing */ /* - required because bridge table likely to big for loading in total into hash table */ proc sort data=bridge presorted; by date decile; run; proc sql; create index idx_dt_dec on bridge (date, decile); quit; /* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */ data _null_; dcl hash h1(ordered:'y', multidata:'y'); h1.defineKey('Zipcode','ran_no'); h1.defineData('id','zipcode','ran_no'); h1.defineDone(); call streaminit(10); do until(_last); set no_dec end=_last; ran_no=rand('uniform'); _rc=h1.add(); end; _rc=h1.output(dataset:'no_dec_ranno'); stop; run; /************ draw control *************************************/ data control(keep=id_dec id zipcode date Decile select_cnt) control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt) ; length id_dec 8; if _n_=1 then do; call streaminit(10); /* load no_dec_ranno into hash replacing the random values by a counter */ /* - the order is still random but the counter instead of a random value will allow to address specific items later on */ /* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB */ dcl hash h_nodec(ordered:'y'); h_nodec.defineKey('Zipcode','seq_no'); h_nodec.defineData('id','zipcode','seq_no'); h_nodec.defineDone(); do until(_last); set no_dec_ranno(drop=ran_no) end=_last; by Zipcode; if first.zipcode then seq_no=1; else seq_no+1; _rc=h_nodec.add(); end; /* create hash for bridge data */ dcl hash h_brdg(dataset:'bridge(obs=0)', multidata:'y', ordered:'y'); h_brdg.defineKey('date','decile'); h_brdg.defineData('zipcode'); h_brdg.defineDone(); /* hash to store per zipcode the last seq_no (counter) used to populate the table with control data */ dcl hash h_last_ranno(); h_last_ranno.defineKey('zipcode'); h_last_ranno.defineData('seq_no'); h_last_ranno.defineDone(); /* hash to store for the current row of dec (file 1) all matching rows from the bridge table */ dcl hash h_zipcollect(); h_zipcollect.defineKey('iter'); h_zipcollect.defineData('zipcode'); h_zipcollect.defineDone(); end; call missing(of _all_); set dec(rename=(id=id_dec)); by date decile; if first.decile then do; /* load bridge data for current date into hash h_brdg */ _rc=h_brdg.delete(); h_brdg = _new_ hash(dataset:catx(' ','bridge(where=(date=',date,'and decile=',decile,'))'), multidata:'y', ordered:'y'); h_brdg.defineKey('date','decile'); h_brdg.defineData('zipcode'); h_brdg.defineDone(); /* "copy" h_brdg to h_zipcollect that got a sequence key (iter) which allows to directly address items */ _rc=h_zipcollect.clear(); do iter=1 by 1 while(h_brdg.do_over() = 0); _rc=h_zipcollect.add(); end; end; /*** draw two controls per row in dec (file 1) ***/ select_cnt=0; /** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */ do i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */ /* random selection of one of the zipcodes in hash h_zipcollect */ iter=rand('integer',1,h_zipcollect.num_items); if h_zipcollect.find()=0 then do; /** 2. draw control **/ /* for the current zipcode derive lowest value for seq_no for the row that hasn't been drawn already */ if h_last_ranno.find() ne 0 then do; seq_no=1; _rc=h_last_ranno.add(); end; /* draw control record */ if h_nodec.find()=0 then do; /* count how many control records selected for the current record from dec */ select_cnt=sum(select_cnt,1); output control; /* remove selected record from hash as we won't select it again */ _rc=h_nodec.remove(); /* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */ seq_no=sum(seq_no,1); _rc=h_last_ranno.replace(); end; else do; /* in case all rows for a zipcode have already been drawn don't use this zipcode any further */ _rc=h_zipcollect.remove(); end; end; end; if select_cnt<2 then output control_insufficient_data; run; /* title 'control'; */ /* proc print data=control; */ /* run; */ /* title 'Decedent with insufficient matching data to create control'; */ /* proc sql; */ /* select * */ /* from control_insufficient_data; */ /* quit; */ /* title; */

JKHess · Posted 02-03-2025 02:12 PM

Thank you @Patrick. Let me take a little time to look through these. I'm working in a virtual research environment, and my workspace limit is apparently 5 Tb. I only know this because the help desk alerted me when I exceeded this when running my cartesian product code..

Patrick · Posted 02-03-2025 06:45 PM

@JKHess wrote:
Thank you @Patrick. Let me take a little time to look through these. I'm working in a virtual research environment, and my workspace limit is apparently 5 Tb. I only know this because the help desk alerted me when I exceeded this when running my cartesian product code..

5TB is disk space where your SAS WORK resides and not memory which will be lower. Running the proc options as provided earlier will tell you how much memory you've got.

JKHess · Posted 02-04-2025 06:58 PM

@Patrick, I ran option 1 on the full files, after testing it on a smaller subset, and tracing the results through each of the three input files. It worked! And no issues with memory. Thank you so much...grateful for the help.

Patrick · Posted 02-04-2025 07:05 PM

@JKHess wrote:
@Patrick, I ran option 1 on the full files, after testing it on a smaller subset, and tracing the results through each of the three input files. It worked! And no issues with memory. Thank you so much...grateful for the help.

@JKHess That's great to hear. Out of curiosity: How much time did it take to run this process with your full data volume?

And just one more thought for you to consider:

I assume that it's statistically acceptable for your 2 year observation period that the records in File 3 are a snapshot without date range (=people moving zipcode not covered). @Tom mentioned this already.

But what about the deciles? If they are per zipcode and date then I believe your analysis will be skewed towards rural low population density zipcodes.

JKHess · Posted 02-04-2025 11:36 PM

HI @Patrick, the process took just under an hour to run, quicker than I expected.

Beneficiaries in this dataset are assigned a zipcode for each of the two calendar years. I believe it's the mailing address zipcode as of the end of the year. The vast majority had the same zipcode both years.

If that doesn't answer your question, please let me know. I may be missing something..

Patrick · Posted 02-04-2025 11:52 PM

@JKHess I believe the more important question that needs answering is:
"But what about the deciles? If they are per zipcode and date then I believe your analysis will be skewed towards rural low population density zipcodes."

The code you're currently using will randomly select a zipcode in scope to draw a control from. I would assume that the number of beneficiaries in your File3 data differs between zipcodes (I assume zipcodes from rural areas have tendentially less rows). Because the random selection of zipcodes is not weighted by number of rows under a zipcode, people living under a zipcode with less entries will have a higher probability to get selected as control (assumed rural has higher probability than urban).

If above is an issue then the code needs amendment.

JKHess · Posted 02-05-2025 07:54 PM

@Patrick, you are correct. I pulled in Rural-Urban Commuting Area (RUCA) codes for each zipcode, and did a PROC FREQ by case status. Cases are 76% metropolitan and 4.7% rural, controls are 63.5% metro and 13.2% rural..

JKHess · Posted 02-05-2025 08:49 PM

...I agree this is a problem. How can I weight by the number of records for each zipcode?

Patrick · Posted 02-05-2025 09:14 PM

@JKHess wrote:
...I agree this is a problem. How can I weight by the number of records for each zipcode?

I believe I have an approach how to implement this change with acceptable impact on performance. I'll give it a go once time.

Patrick · Posted 02-07-2025 08:03 PM

@JKHess wrote:
...I agree this is a problem. How can I weight by the number of records for each zipcode?

Below the amended code that now should choose id's from any of the zipcodes in scope with the same probability.

The hard bit was to come-up with a change for a weighted random selection of a zipcode where performance doesn't degrade too much. I couldn't avoid implementation of some additional looping. Let me know the runtime (and if it's still manageable).
And once again: Please review both the code and validate the result.

Spoiler

/*************** create sample data ************************/
/* file 1 */
data dec; 
  input ID Date :mmddyy10. Decile;
  format Date mmddyy10.;
datalines;
1 1/1/2017 1 
22 1/1/2017 1
41 1/1/2017 1
56 1/1/2017 2
79 1/1/2017 2
85 1/1/2017 2
100 1/2/2017 1
118 1/2/2017 1
125 1/2/2017 2
167 1/2/2017 2
178 1/2/2017 3
;
run;

/* file 2 - not really a bridge because relationship bridge:no_dec is many:many */
data bridge; 
  input Date :mmddyy10.  Decile Zipcode $5.;
  format Date mmddyy10.;
datalines;
1/1/2017 1 88123
1/1/2017 1 03867
1/1/2017 1 04001
1/1/2017 2 03304
1/1/2017 2 98765
1/1/2017 2 96224
1/1/2017 2 00001
1/2/2017 1 98801
1/2/2017 2 88123
1/2/2017 2 12345
1/2/2017 2 83356
1/2/2017 2 98765
1/2/2017 3 03304
1/2/2017 3 04945
;
run;

/* file 3 */
data no_dec; 
  input ID Zipcode $5.;
datalines;
2 88123
21 88123
22 88123
23 88123
24 88123
3 12345
4 03304
5 03867
6 04945
7 04001
8 98765
9 98801
10 96224
11 00001
12 83356
13 83356
;
run;

/************  data prep *************************************/
/* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */
data _null_;
  dcl hash h1(ordered:'y', multidata:'y');
  h1.defineKey('Zipcode','ran_no');
  h1.defineData('id','zipcode','ran_no');
  h1.defineDone();
  call streaminit(10);
  do until(_last);
    set no_dec end=_last;
    ran_no=rand('uniform');
    _rc=h1.add();
  end;
  _rc=h1.output(dataset:'no_dec_ranno');
  stop;
run;

/************  draw control *************************************/
data control(keep=id_dec id zipcode date Decile select_cnt)
     control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt)
     ;
  length id_dec 8;
  if _n_=1 then
    do;
      call streaminit(10);
      
      /* define hash to collect number of rows (items) per zipcode                       */
      /* - used for weighted random selection of zipcode from which to draw control from */
      n_items=0;
      dcl hash h_nodec_nperzip();
      h_nodec_nperzip.defineKey('zipcode');
      h_nodec_nperzip.defineData('n_items');
      h_nodec_nperzip.defineDone();
      /* load no_dec_ranno into hash replacing the random values by a sequence number (by zipcode)                                      */
      /* - the order is still random but the sequence number instead of a random value will allow to address specific items later on    */
      /* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB          */
      dcl hash h_nodec(ordered:'y');
      h_nodec.defineKey('Zipcode','seq_no');
      h_nodec.defineData('id','zipcode','seq_no');
      h_nodec.defineDone();
      do until(_last);
        set no_dec_ranno(drop=ran_no) end=_last;
        by Zipcode;
        /* populate hash h_nodec */
        if first.zipcode then seq_no=1;
        else seq_no+1;
        _rc=h_nodec.add();
        
        /* populate hash h_nodec_nperzip */
        n_items=sum(n_items,1);
        if last.zipcode then
          do;
            _rc=h_nodec_nperzip.add();
            n_items=0;
          end;
      end;
      /* load the bridge data into a hash */
      dcl hash h_brdg(dataset:'bridge', multidata:'y', ordered:'y');
      h_brdg.defineKey('date','decile');
      h_brdg.defineData('zipcode');
      h_brdg.defineDone();
      /* hash to store per zipcode the last sequence number used to populate the table with control data */
      /* - to ensure a record gets only drawn once                                                       */
      dcl hash h_last_ranno();
      h_last_ranno.defineKey('zipcode');
      h_last_ranno.defineData('seq_no');
      h_last_ranno.defineDone();
      
      /* arrays to store zipcode and cumulative sum of number of items under a zipcode */
      array a_zipcode{10000} $5  _temporary_;
      array a_itemcum{0:10000} 8 _temporary_;
      a_itemcum[0]=0;
    end;
  call missing(of _all_);
  
  set dec(rename=(id=id_dec));
  
  /*** draw two controls for case ***/
  /** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */ 
  /* load all zipcodes from the bridge into hash h_zipcollect that match with the current row from dec (file 1) */
  
  _rc=h_brdg.reset_dup();
  do _i=1 by 1 while(h_brdg.do_over() = 0);
    _rc=h_nodec_nperzip.find()=0;
    a_zipcode[_i]=zipcode;
    a_itemcum[_i]=sum(a_itemcum[_i-1],n_items);
  end;
  a_itemcum_n_elements=sum(_i,-1);
  
  select_cnt=0;
  do _i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */
  
    /** random selection of one of the matching zipcodes from array a_zipcode, weighted by number of items per zipcode **/
    /* create random integer in the range of 1 to n zipcodes to choose from */
    ran_val=rand('integer',1,a_itemcum[a_itemcum_n_elements]);
    
    /* binary search through array a_itemcum to find the element that stores the higher boundary          */ 
    /* - when found use the index of this element to derive the zipcode from which to draw control record */
    lbound=1;
    hbound=a_itemcum[a_itemcum_n_elements];
    do while(lbound <= hbound);
      midpt=floor(sum(lbound,hbound)/2);
      
      if a_itemcum[midpt-1] >= ran_val then hbound=midpt-1;
      else
      if a_itemcum[midpt]   <  ran_val then lbound=midpt+1;
      else
      if a_itemcum[midpt-1] <  ran_val <= a_itemcum[midpt] then
        do;
          zipcode=a_zipcode[midpt];
          leave;
        end;
    end;
    
    /** 2. draw control from population under selected zip code **/
    /* for the chosen zipcode derive the row with the lowest seq_no that hasn't been drawn previously */
    if h_last_ranno.find() ne 0 then
      do;
        seq_no=1;
        _rc=h_last_ranno.add();
      end;
    /* draw control record */ 
    if h_nodec.find()=0 then
      do;
        /* count how many control records selected for the current record from dec */
        select_cnt=sum(select_cnt,1);
        output control;
        /* remove selected record from hash as we won't select it again */
        _rc=h_nodec.remove();
        /* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */
        seq_no=sum(seq_no,1);
        _rc=h_last_ranno.replace();
      end;
  end;
  
  if select_cnt<2 then output control_insufficient_data;
run;

/* title 'control'; */
/* proc print data=control; */
/* run; */
/* title 'Decedent with insufficient matching data to create control'; */
/* proc sql; */
/*  select * */
/*  from control_insufficient_data; */
/* quit; */
/* title; */

/*************** create sample data ************************/ /* file 1 */ data dec; input ID Date :mmddyy10. Decile; format Date mmddyy10.; datalines; 1 1/1/2017 1 22 1/1/2017 1 41 1/1/2017 1 56 1/1/2017 2 79 1/1/2017 2 85 1/1/2017 2 100 1/2/2017 1 118 1/2/2017 1 125 1/2/2017 2 167 1/2/2017 2 178 1/2/2017 3 ; run; /* file 2 - not really a bridge because relationship bridge:no_dec is many:many */ data bridge; input Date :mmddyy10. Decile Zipcode $5.; format Date mmddyy10.; datalines; 1/1/2017 1 88123 1/1/2017 1 03867 1/1/2017 1 04001 1/1/2017 2 03304 1/1/2017 2 98765 1/1/2017 2 96224 1/1/2017 2 00001 1/2/2017 1 98801 1/2/2017 2 88123 1/2/2017 2 12345 1/2/2017 2 83356 1/2/2017 2 98765 1/2/2017 3 03304 1/2/2017 3 04945 ; run; /* file 3 */ data no_dec; input ID Zipcode $5.; datalines; 2 88123 21 88123 22 88123 23 88123 24 88123 3 12345 4 03304 5 03867 6 04945 7 04001 8 98765 9 98801 10 96224 11 00001 12 83356 13 83356 ; run; /************ data prep *************************************/ /* assign a random value to each entry in no_dec (file 3) and output as table no_dec_ranno sorted by zipcode and random value */ data _null_; dcl hash h1(ordered:'y', multidata:'y'); h1.defineKey('Zipcode','ran_no'); h1.defineData('id','zipcode','ran_no'); h1.defineDone(); call streaminit(10); do until(_last); set no_dec end=_last; ran_no=rand('uniform'); _rc=h1.add(); end; _rc=h1.output(dataset:'no_dec_ranno'); stop; run; /************ draw control *************************************/ data control(keep=id_dec id zipcode date Decile select_cnt) control_insufficient_data(keep=id_dec id zipcode date Decile select_cnt) ; length id_dec 8; if _n_=1 then do; call streaminit(10); /* define hash to collect number of rows (items) per zipcode */ /* - used for weighted random selection of zipcode from which to draw control from */ n_items=0; dcl hash h_nodec_nperzip(); h_nodec_nperzip.defineKey('zipcode'); h_nodec_nperzip.defineData('n_items'); h_nodec_nperzip.defineDone(); /* load no_dec_ranno into hash replacing the random values by a sequence number (by zipcode) */ /* - the order is still random but the sequence number instead of a random value will allow to address specific items later on */ /* - memory consumption of this hash is around 88bytes * number of items plus some overhead. For 34.6M rows close to 3GB */ dcl hash h_nodec(ordered:'y'); h_nodec.defineKey('Zipcode','seq_no'); h_nodec.defineData('id','zipcode','seq_no'); h_nodec.defineDone(); do until(_last); set no_dec_ranno(drop=ran_no) end=_last; by Zipcode; /* populate hash h_nodec */ if first.zipcode then seq_no=1; else seq_no+1; _rc=h_nodec.add(); /* populate hash h_nodec_nperzip */ n_items=sum(n_items,1); if last.zipcode then do; _rc=h_nodec_nperzip.add(); n_items=0; end; end; /* load the bridge data into a hash */ dcl hash h_brdg(dataset:'bridge', multidata:'y', ordered:'y'); h_brdg.defineKey('date','decile'); h_brdg.defineData('zipcode'); h_brdg.defineDone(); /* hash to store per zipcode the last sequence number used to populate the table with control data */ /* - to ensure a record gets only drawn once */ dcl hash h_last_ranno(); h_last_ranno.defineKey('zipcode'); h_last_ranno.defineData('seq_no'); h_last_ranno.defineDone(); /* arrays to store zipcode and cumulative sum of number of items under a zipcode */ array a_zipcode{10000} $5 _temporary_; array a_itemcum{0:10000} 8 _temporary_; a_itemcum[0]=0; end; call missing(of _all_); set dec(rename=(id=id_dec)); /*** draw two controls for case ***/ /** 1. select zipcode from bridge for lookup of rows in no_dec (file 3) */ /* load all zipcodes from the bridge into hash h_zipcollect that match with the current row from dec (file 1) */ _rc=h_brdg.reset_dup(); do _i=1 by 1 while(h_brdg.do_over() = 0); _rc=h_nodec_nperzip.find()=0; a_zipcode[_i]=zipcode; a_itemcum[_i]=sum(a_itemcum[_i-1],n_items); end; a_itemcum_n_elements=sum(_i,-1); select_cnt=0; do _i=1 to 99 until(select_cnt=2); /* if sufficient data to draw control from, loop will only iterate twice */ /** random selection of one of the matching zipcodes from array a_zipcode, weighted by number of items per zipcode **/ /* create random integer in the range of 1 to n zipcodes to choose from */ ran_val=rand('integer',1,a_itemcum[a_itemcum_n_elements]); /* binary search through array a_itemcum to find the element that stores the higher boundary */ /* - when found use the index of this element to derive the zipcode from which to draw control record */ lbound=1; hbound=a_itemcum[a_itemcum_n_elements]; do while(lbound <= hbound); midpt=floor(sum(lbound,hbound)/2); if a_itemcum[midpt-1] >= ran_val then hbound=midpt-1; else if a_itemcum[midpt] < ran_val then lbound=midpt+1; else if a_itemcum[midpt-1] < ran_val <= a_itemcum[midpt] then do; zipcode=a_zipcode[midpt]; leave; end; end; /** 2. draw control from population under selected zip code **/ /* for the chosen zipcode derive the row with the lowest seq_no that hasn't been drawn previously */ if h_last_ranno.find() ne 0 then do; seq_no=1; _rc=h_last_ranno.add(); end; /* draw control record */ if h_nodec.find()=0 then do; /* count how many control records selected for the current record from dec */ select_cnt=sum(select_cnt,1); output control; /* remove selected record from hash as we won't select it again */ _rc=h_nodec.remove(); /* increase seq_no by 1 for this zipcode as prep of selection of another row for the table with controls */ seq_no=sum(seq_no,1); _rc=h_last_ranno.replace(); end; end; if select_cnt<2 then output control_insufficient_data; run; /* title 'control'; */ /* proc print data=control; */ /* run; */ /* title 'Decedent with insufficient matching data to create control'; */ /* proc sql; */ /* select * */ /* from control_insufficient_data; */ /* quit; */ /* title; */

I do have another design idea how to approach your problem which would likely perform quite a bit better but for which I would need to know more about your data and environment (to not waste memory) to decide if it's feasible. Let's see how above code performs before we go there.

About the address changes:

The dataset had zipcode assigned to each beneficiary annually. About 5% changed zipcode from one calendar year to the next

If you've got annual snapshots then why not run the process for each year separately using the matching yearly snapshot. That shouldn't take that much but return better data which must be in your interest.

JKHess · Posted 02-07-2025 09:06 PM

@Patrick, sorry, did you edit this code? I sent you about an error message I received when I ran this..

How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Re: How do I randomly select controls from within a large administrative dataset using Proc SQL

Registration is open

SAS Training: Just a Click Away