About LeStatisticien

LeStatisticien · ‎01-30-2020

This post is a sort of tutorial for those doing selective simulations inside population micro data files, where multiple attributes of a respondent determine the probability distribution used to control a simulation for that person. The underlying strategy is as ‘old as the hills’; but I have not seen an illustration that involves SAS Hash (look-up) tables. My file involves simulations for three variables, and they are stacked sets of operations. For each set, I used to move the population file outside SAS, do the work and then bring it back. Now, thanks to FreelanceReinhard and the discussion at https://support.sas.com/resources/papers/proceedings15/3024-2015.pdf, all three steps are done inside SAS, saving a ton of time. So this is huge! The code below deals with the last set of operations only. The experts here will be annoyed at the needless comments; but this post is for people like me who have used SAS for decades and did not know that doing this sort of simulation work was feasible without leaving SAS. So thanks again FreelanceReinhard. /* =========== START IMPUTE ADL_Categ HERE */ DATA temp3; SET sasfiles.simulmar80_ont ; RUN; ... DATA ADLCategDonorDistribs; INFILE '/folders/myfolders/ADL_Categ DonorDist.csv' DELIMITER=',' DSD ; INPUT sex ageg pCat0 pCat1 pCat2 ; /* signature == sex ageg; proportions== pCat0 pCat1 pCat2 */ RUN; /* A donor file line looks like this: 2,18,0.635780988,0.332386646,0.031832366 . Both the name and the coding for "sex" and "ageg" must be IDENTICAL between this file and the population file */ /* CAUTION ===== run this segment only when you want to reset the seed. */ DATA _NULL_; CALL STREAMINIT(0); /* generate seed from system clock */ X = RAND("UNIFORM"); RUN; %PUT &=SYSRANDOM; /* See: https://blogs.sas.com/content/iml/2017/06/01/choose-seed-random-number.html */ /* ===== IMPORTANT: this is where we set the number of imputations that will be generated (for a given variable) for each respondent. We can use a number greater than 1 at n=1 below to bootstrap a confidence interval for the imputed value. */ %let n=1; /* sample size per simulation (one record one variable) declared as a macro variable */ QUIT; /* The simulation is done in this step */ DATA SimulOutput ; CALL STREAMINIT(313777059); /* Here we set the seed for the random number drawing, */ /* Now we load into RAM a collection of donor distributions, one for each unique respondent signature. We load it into a "hash" (or look-up) table. _N_=1 means the load is executed as soon as the first observation is brought into RAM */ IF _N_=1 THEN DO; DECLARE HASH obj(DATASET:'ADLCategDonorDistribs'); obj.DEFINEKEY('sex','ageg'); obj.DEFINEDATA('pCat0','pCat1','pCat2'); obj.DEFINEDONE(); IF 0 THEN SET ADLCategDonorDistribs ; /* FreelanceReinhard says to leave this alone. The SAS coders have not revealed why this is needed here */ END; /* At this point the entire collection of donor distrubutions is in RAM. SAS should prominently offer me a chance to confirm that they have been read (loaded) correctly. FreelanceReinhard has sent me the code I can used to retrieve what was read. */ SET temp3; /* This is the pop (observations) file. */ ADL_Categ2 = ADL_Categ; /* We are imputing selected values for ADL_Categ; but we leave the original values unchanged */ /* Retrieve the correct donor distribution for this particular observation and draw random sample(s). It is a sample of 1 when &n = 1 . obj.FIND() searches for the match and when it is found RAND(...) randomly choses 1 (Cat0), 2 (Cat1) or 3 (Cat2) under control of the probabilities 'pCat0','pCat1','pCat2' . */ IF Age80Plus EQ 1 THEN DO; IF obj.FIND()=0 THEN DO sampno=1 to &n; y=RAND('TABLE', OF pCat0 pCat1 pCat2); IF y EQ 1 THEN ADL_Categ2 = 0; IF y EQ 2 THEN ADL_Categ2 = 1 ; IF y EQ 3 THEN ADL_Categ2 = 2; OUTPUT; END; END; RUN; /* obj.FIND() is a function (method??) that causes a search among lines of the donor distrution file to find the one whose signature matches the key declared above. If an eligible line on the PopFile has no matching key on the DistribDonor file, SAS jumps to the next PopFile line. You can use user-defined tables later on to see which eligible lines failed to get an imputation of ADL_Categ2. */ PROC FREQ ; TABLES sampno sex ageg y ADL_Categ2 ; RUN; /* Do these distributions look decent? Check this before going further. It may be good to add ageg*ADL_Categ and ageg*ADL_Categ2 for more detailed checking . */ DATA temp4; SET temp3; IF Age80Plus EQ 0 ; ADL_Categ2 = ADL_Categ; RUN; /* This operation cannot be done where the look-up table is being used to find key matches. REM. -- records with non-matches are simply ignored */ DATA temp5; SET temp4 SimulOutput ; RUN; /* Here we concatenate the two datasets. Rem. that in SimulOutput all persons are aged 80+, and in temp4 there are no such persons. */ PROC FREQ DATA=temp5; TABLES sex ageg mar ADL_Categ2 ageg*ADL_Categ ageg*ADL_Categ2; RUN; /* Check that all these look decent. */ DATA sasfiles.Sim_ADLCateg_80_ONT ; SET temp5; RUN; /* This dataset has all the simulations on board */ /* IMP: Breakdown of 80-plus to show properly simulated details for 80-84, 85-89 and 90+ are in this file for ageg, mar and ADL_Categ2 ONLY. For all other tabulations, at ages 80-plus show ONE line for 80-plus, which will be the survey data (not simulation output). */

LeStatisticien · ‎01-28-2020

Thanks for excellent help, FreelanceReinhard. I will study your points carefully, and benefit from them as I move the code to the 'next level' for my real-world applications. The two references you passed along are gratefully received, and will be used repeatedly in the months ahead. I welcome this new (to me - old to you) and massive improvement in the SAS functionality.

LeStatisticien · ‎01-27-2020

Apologies for all the "typos"; especially "exhaust" which should be SAS.

LeStatisticien · ‎01-27-2020

Hello FreelanceReinhard! Your code (adapted in the direction of my eventual application) ran the first time with no error message. So thanks again, and the code is below. I have a few education questions following the code. /* Create test data */ DATA observations; INPUT id x1 x2 x3; CARDS; 1 3 1 4 2 5 9 2 3 8 0 3 ; RUN; /* donor distributions for the simulation */ DATA AgeDonorDistribs; INPUT sig p1 p2 p3 ; CARDS; 144 .1 .2 .5 15000 .3 .4 .2 32000 .4 .2 .3 43740 .6 .1 .1 48600 .2 .1 .3 15746400 .2 .3 .2 ; /* sig, the ID of an imputation donor distribution, can also be a vector of categorical variables that represent respondent attributes - see https://support.sas.com/resources/papers/proceedings15/3024-2015.pdf and https://www.linkedin.com/pulse/sas-hash-tables-patrick-cuba N.B. In a real donor distribution we require p1+p2+p3 = 1 */ %let n=1; /* sample size per ID */ /* n=1 is normal in population simulation, except where we are setting up sesitivity tests */ DATA SimulOutput(KEEP=id sampno y); CALL STREAMINIT(27182818); IF _N_=1 THEN DO; DECLARE HASH obj(DATASET:'AgeDonorDistribs'); obj.DEFINEKEY('sig'); obj.DEFINEDATA('p1','p2','p3'); obj.DEFINEDONE(); IF 0 THEN SET AgeDonorDistribs ; /* What on earth does this do??? */ END; SET observations; /* compute the signature of the observation now in RAM */ sig=2**x1*3**x2*5**x3; /* retrieve the correct donor distribution for this particular observation and draw random sample(s). It is a sample of 1 when &n = 1 */ IF obj.FIND()=0 THEN DO sampno=1 to &n; y=RAND('TABLE', OF p1 p2 p3); OUTPUT; END; RUN; Questions: (1) Once the collection of simulation-donor distributions is loaded into RAM from an external file, is there some way for me to instructs SAS to bring the table up to the screen so that I can confirm that the loading has happened correctly? (2) I expect exhaust to issue a syntax error message AFTER DO here: " IF obj.FIND()=0 THEN DO sampno=1 to &n;". Why did not do so? (3) Help me here please: what exactly is this expression doing ? IF obj.FIND()=0 . On the surface, it looks as if an actions to be taken up a condition does not exist (see =0); what I see in the output WORK file that appropriate action seems to have been taken. (4) What error message will I get when the look-up table signatures contain no match for the computed signature of the just loaded record? (My first question is important here because the failure to match could be caused by a reading error as the look-up table was being brought into RAM). (5) In "IF 0 THEN SET AgeDonorDistribs ;", what is being evaluated at zero, and why do we point to the donor distributions when the DECLARE JHASH ... has already done that? Thanks in advance. Your answers will help me to intelligently make big changes in the details for my eventual applications.

LeStatisticien · ‎01-26-2020

My goodness Reeza you have sent me to a gold mine! Just starting to read the piece I see the exact philosophy that is driving my work, and my situation is a bit easier in that most of my micro data file comprises values drawn from survey observations. I am going to enjoy learning the procedures covered in that paper. So thanks again!

LeStatisticien · ‎01-26-2020

Thanks for the link Reeza, and yes I am dealing with public use micro data files. My output involves 'patching' information into File A based on patterns in File B that involve networks of predictor variables that both files share (using hopefully reasonable assumptions and supported where feasible with relevant multivariate analysis), so there is quite a bit of 'synthetic stuff' in my records. I expect i will learn some useful ideas from the page to which you are sending me.

LeStatisticien · ‎01-26-2020

Wonderful FreelanceReinhard! I have used Option1 in the R language and am avoiding it in SAS because the situation does not allow a formula to be built; but even before trying out your Option 2 I am fascinated with the idea. So I am going to "dig in here" and if I get stuck some place I will ask for more help. In any event, at the end to the road, doing all this inside SAS is a huge development since in my simulation work I often am at a point where the micro data file has several variables with simulated values here and there and it is embedded in a SAS dataset. Cheers!

LeStatisticien · ‎01-25-2020

Help please. I wish to do a simulation using an arbitrary distribution over the possible values of a categorical variable. The TABLE facility along with RAND seem just what I need; but I have a collection of such distributions. The program must decide which is the correct distribution in this collection to use, based upon the respondent's attributes on a few variables. I have dug hard in our SAS help literature but I have not found an example where the p(*) can change from one respondent to the next. I would be overjoyed to learn what steps would achieve my goal. Thanks in advance. PS. The way that I have done this outside of SAS is to attach a signature (a list of indicator values) to each distribution (the collection resides in an external data file). My program then computes the signature from a respondent's record, and when this signature matches that of one of the collection of distributions the program will use that distribution to proceed with the simulation. Now I want to stop doing this work outside of SAS.

Online Status	Offline
Date Last Visited	‎02-06-2020 01:07 PM

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Drawing a random sample from arbitrary distributions when the probabil...

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Re: Drawing a random sample from arbitrary distributions when the prob...

Drawing a random sample from arbitrary distributions when the probabil...