I have a large health plan dataset (n~36 million records) that I need to randomly select two non-decedent controls for each decedent case. The matching variable is decile of air pollutant exposure, which was calculated from the full distribution of daily pollutant concentrations for all zip codes occurring in the dataset, across the 2-year study period. I've tried to approach this with 3 files: File 1 - Decedent cases (n=2.8 million): ID, Date (DeathDt), Decile (on DeathDt) File 2 - Bridge file: Date, Zipcode, Decile [file contains all possible combinations of date and zipcode in the non-decedent file, with decile assigned based on pollutant concentration on that date for that zipcode] File 3 - Non-decedent controls (n=34.6 million): ID, Zipcode The files look like this: File 1: ID Date Decile 1 1/1/2017 1 22 1/1/2017 1 41 1/1/2017 1 56 1/1/2017 2 79 1/1/2017 2 85 1/1/2017 2 100 1/2/2017 1 118 1/2/2017 1 125 1/2/2017 2 167 1/2/2017 2 178 1/2/2017 3 File 2: Date Zipcode Decile 1/1/2017 12832 1 1/1/2017 03349 1 1/1/2017 04001 2 1/2/2017 56723 2 1/2/2017 88123 1 1/3/2017 80010 3 1/3/2017 96224 3 File 3: ID Zipcode 2 88123 3 12345 4 03304 5 03867 6 04945 7 04001 8 98765 9 98801 10 96224 11 00001 12 83356 Again, for each decedent case, I need to randomly select 2 non-decedent controls who were alive as of the case's death date, but this can't be done directly because there is no date in the non-decedent file. I created a cartesian product joining Files 1 and 2, which produced a file named 'combined', and gave me the full range of zipcodes that are in the same decile as the case, on the case's date of death (code below). My plan was to join this file with the non-decedents file by zipcode, and then select 2 controls randomly. But at this point, the files were huge, and I ran out of space...and I don't know really how to do the random selection from here. I'm thinking there has to be a more efficient way to do this. Sorry for the complicated description. Would appreciate any advice for how to do this.. proc sql; create table combined as select f1.decile as CaseDecile, f1.date as CaseDate f2.decile as BridgeDecile, f2.date as BridgeDate, f2.zipcode as BridgeZipcode from decedents as f1, bridge as f2 where f1.decile = f2.decile AND f1.date = f2.date; quit;
... View more