This post is a sort of tutorial for those doing selective simulations inside population micro data files, where multiple attributes of a respondent determine the probability distribution used to control a simulation for that person. The underlying strategy is as ‘old as the hills’; but I have not seen an illustration that involves SAS Hash (look-up) tables. My file involves simulations for three variables, and they are stacked sets of operations. For each set, I used to move the population file outside SAS, do the work and then bring it back. Now, thanks to FreelanceReinhard and the discussion at https://support.sas.com/resources/papers/proceedings15/3024-2015.pdf, all three steps are done inside SAS, saving a ton of time. So this is huge! The code below deals with the last set of operations only. The experts here will be annoyed at the needless comments; but this post is for people like me who have used SAS for decades and did not know that doing this sort of simulation work was feasible without leaving SAS. So thanks again FreelanceReinhard. /* =========== START IMPUTE ADL_Categ HERE */ DATA temp3; SET sasfiles.simulmar80_ont ; RUN; ... DATA ADLCategDonorDistribs; INFILE '/folders/myfolders/ADL_Categ DonorDist.csv' DELIMITER=',' DSD ; INPUT sex ageg pCat0 pCat1 pCat2 ; /* signature == sex ageg; proportions== pCat0 pCat1 pCat2 */ RUN; /* A donor file line looks like this: 2,18,0.635780988,0.332386646,0.031832366 . Both the name and the coding for "sex" and "ageg" must be IDENTICAL between this file and the population file */ /* CAUTION ===== run this segment only when you want to reset the seed. */ DATA _NULL_; CALL STREAMINIT(0); /* generate seed from system clock */ X = RAND("UNIFORM"); RUN; %PUT &=SYSRANDOM; /* See: https://blogs.sas.com/content/iml/2017/06/01/choose-seed-random-number.html */ /* ===== IMPORTANT: this is where we set the number of imputations that will be generated (for a given variable) for each respondent. We can use a number greater than 1 at n=1 below to bootstrap a confidence interval for the imputed value. */ %let n=1; /* sample size per simulation (one record one variable) declared as a macro variable */ QUIT; /* The simulation is done in this step */ DATA SimulOutput ; CALL STREAMINIT(313777059); /* Here we set the seed for the random number drawing, */ /* Now we load into RAM a collection of donor distributions, one for each unique respondent signature. We load it into a "hash" (or look-up) table. _N_=1 means the load is executed as soon as the first observation is brought into RAM */ IF _N_=1 THEN DO; DECLARE HASH obj(DATASET:'ADLCategDonorDistribs'); obj.DEFINEKEY('sex','ageg'); obj.DEFINEDATA('pCat0','pCat1','pCat2'); obj.DEFINEDONE(); IF 0 THEN SET ADLCategDonorDistribs ; /* FreelanceReinhard says to leave this alone. The SAS coders have not revealed why this is needed here */ END; /* At this point the entire collection of donor distrubutions is in RAM. SAS should prominently offer me a chance to confirm that they have been read (loaded) correctly. FreelanceReinhard has sent me the code I can used to retrieve what was read. */ SET temp3; /* This is the pop (observations) file. */ ADL_Categ2 = ADL_Categ; /* We are imputing selected values for ADL_Categ; but we leave the original values unchanged */ /* Retrieve the correct donor distribution for this particular observation and draw random sample(s). It is a sample of 1 when &n = 1 . obj.FIND() searches for the match and when it is found RAND(...) randomly choses 1 (Cat0), 2 (Cat1) or 3 (Cat2) under control of the probabilities 'pCat0','pCat1','pCat2' . */ IF Age80Plus EQ 1 THEN DO; IF obj.FIND()=0 THEN DO sampno=1 to &n; y=RAND('TABLE', OF pCat0 pCat1 pCat2); IF y EQ 1 THEN ADL_Categ2 = 0; IF y EQ 2 THEN ADL_Categ2 = 1 ; IF y EQ 3 THEN ADL_Categ2 = 2; OUTPUT; END; END; RUN; /* obj.FIND() is a function (method??) that causes a search among lines of the donor distrution file to find the one whose signature matches the key declared above. If an eligible line on the PopFile has no matching key on the DistribDonor file, SAS jumps to the next PopFile line. You can use user-defined tables later on to see which eligible lines failed to get an imputation of ADL_Categ2. */ PROC FREQ ; TABLES sampno sex ageg y ADL_Categ2 ; RUN; /* Do these distributions look decent? Check this before going further. It may be good to add ageg*ADL_Categ and ageg*ADL_Categ2 for more detailed checking . */ DATA temp4; SET temp3; IF Age80Plus EQ 0 ; ADL_Categ2 = ADL_Categ; RUN; /* This operation cannot be done where the look-up table is being used to find key matches. REM. -- records with non-matches are simply ignored */ DATA temp5; SET temp4 SimulOutput ; RUN; /* Here we concatenate the two datasets. Rem. that in SimulOutput all persons are aged 80+, and in temp4 there are no such persons. */ PROC FREQ DATA=temp5; TABLES sex ageg mar ADL_Categ2 ageg*ADL_Categ ageg*ADL_Categ2; RUN; /* Check that all these look decent. */ DATA sasfiles.Sim_ADLCateg_80_ONT ; SET temp5; RUN; /* This dataset has all the simulations on board */ /* IMP: Breakdown of 80-plus to show properly simulated details for 80-84, 85-89 and 90+ are in this file for ageg, mar and ADL_Categ2 ONLY. For all other tabulations, at ages 80-plus show ONE line for 80-plus, which will be the survey data (not simulation output). */
... View more