Help please. I wish to do a simulation using an arbitrary distribution over the possible values of a categorical variable. The TABLE facility along with RAND seem just what I need; but I have a collection of such distributions. The program must decide which is the correct distribution in this collection to use, based upon the respondent's attributes on a few variables.
I have dug hard in our SAS help literature but I have not found an example where the p(*) can change from one respondent to the next. I would be overjoyed to learn what steps would achieve my goal.
Thanks in advance.
PS. The way that I have done this outside of SAS is to attach a signature (a list of indicator values) to each distribution (the collection resides in an external data file). My program then computes the signature from a respondent's record, and when this signature matches that of one of the collection of distributions the program will use that distribution to proceed with the simulation. Now I want to stop doing this work outside of SAS.
Hello @LeStatisticien,
So, you have a dataset with respondent IDs and several variables, say x1, x2, x3 for example, which determine the distribution to draw a random sample from. The distribution is one of a categorical variable with, say 4 values (1, 2, 3, 4) and hence determined by the probabilities p1, p2, p3 for the first three values (then p4=1-p1-p2-p3).
/* Create test data */
data have;
input id x1-x3;
cards;
1 3 1 4
2 5 9 2
3 8 0 3
;
Option 1: Compute the probabilities from x1, x2, x3 if you know a formula. (I use arbitrary formulas below, just for demonstration.)
%let n=100; /* sample size per ID */
data want;
call streaminit(27182818);
set have;
/* compute (arbitrary) probabilities */
p1=1/(3+min(of x:));
p2=1/(3+max(of x:));
p3=1/(3+mean(of x:));
/* draw random samples of size n */
do sampno=1 to &n;
y=rand('table', of p1-p3);
output;
end;
run;
Option 2: Predetermine the probabilities for each "signature" and store them in a SAS dataset. Retrieve them by means of a hash object. (I compute an arbitrary signature below.)
data distr;
input sig p1 p2 p3;
cards;
144 .1 .2 .5
15000 .3 .4 .2
32000 .4 .2 .3
43740 .6 .1 .1
48600 .2 .1 .3
15746400 .2 .3 .2
;
%let n=100; /* sample size per ID */
data want(keep=id sampno y);
call streaminit(27182818);
if _n_=1 then do;
/* load distribution for each signature into hash table */
dcl hash h(dataset:'distr');
h.definekey('sig');
h.definedata('p1','p2','p3');
h.definedone();
if 0 then set distr;
end;
set have;
/* compute signature */
sig=2**x1*3**x2*5**x3;
/* retrieve probabilities and draw random samples */
if h.find()=0 then do sampno=1 to &n;
y=rand('table', of p1-p3);
output;
end;
run;
Dataset distr does not need to be sorted.
If you are generating synthetic data this may be helpful.
https://support.sas.com/resources/papers/proceedings17/1224-2017.pdf
Hope it helps.
Wonderful FreelanceReinhard! I have used Option1 in the R language and am avoiding it in SAS because the situation does not allow a formula to be built; but even before trying out your Option 2 I am fascinated with the idea. So I am going to "dig in here" and if I get stuck some place I will ask for more help. In any event, at the end to the road, doing all this inside SAS is a huge development since in my simulation work I often am at a point where the micro data file has several variables with simulated values here and there and it is embedded in a SAS dataset.
Cheers!
Hello FreelanceReinhard! Your code (adapted in the direction of my eventual application) ran the first time with no error message. So thanks again, and the code is below. I have a few education questions following the code.
/* Create test data */
DATA observations;
INPUT id x1 x2 x3;
CARDS;
1 3 1 4
2 5 9 2
3 8 0 3
;
RUN;
/* donor distributions for the simulation */
DATA AgeDonorDistribs;
INPUT sig p1 p2 p3 ;
CARDS;
144 .1 .2 .5
15000 .3 .4 .2
32000 .4 .2 .3
43740 .6 .1 .1
48600 .2 .1 .3
15746400 .2 .3 .2
;
/* sig, the ID of an imputation donor distribution,
can also be a vector of categorical variables
that represent respondent attributes -
see https://support.sas.com/resources/papers/proceedings15/3024-2015.pdf
and
https://www.linkedin.com/pulse/sas-hash-tables-patrick-cuba
N.B. In a real donor distribution we require p1+p2+p3 = 1 */
%let n=1; /* sample size per ID */
/* n=1 is normal in population simulation, except
where we are setting up sesitivity tests */
DATA SimulOutput(KEEP=id sampno y);
CALL STREAMINIT(27182818);
IF _N_=1 THEN DO;
DECLARE HASH obj(DATASET:'AgeDonorDistribs');
obj.DEFINEKEY('sig');
obj.DEFINEDATA('p1','p2','p3');
obj.DEFINEDONE();
IF 0 THEN SET AgeDonorDistribs ;
/* What on earth does this do??? */
END;
SET observations;
/* compute the signature of the observation now in RAM */
sig=2**x1*3**x2*5**x3;
/* retrieve the correct donor distribution for
this particular observation and draw random sample(s).
It is a sample of 1 when &n = 1 */
IF obj.FIND()=0 THEN
DO sampno=1 to &n;
y=RAND('TABLE', OF p1 p2 p3);
OUTPUT;
END;
RUN;
Questions:
(1) Once the collection of simulation-donor distributions is loaded into RAM from an external file, is there some way for me to instructs SAS to bring the table up to the screen so that I can confirm that the loading has happened correctly?
(2) I expect exhaust to issue a syntax error message AFTER DO here: " IF obj.FIND()=0 THEN DO sampno=1 to &n;". Why did not do so?
(3) Help me here please: what exactly is this expression doing ? IF obj.FIND()=0 . On the surface, it looks as if an actions to be taken up a condition does not exist (see =0); what I see in the output WORK file that appropriate action seems to have been taken.
(4) What error message will I get when the look-up table signatures contain no match for the computed signature of the just loaded record? (My first question is important here because the failure to match could be caused by a reading error as the look-up table was being brought into RAM).
(5) In "IF 0 THEN SET AgeDonorDistribs ;", what is being evaluated at zero, and why do we point to the donor distributions when the DECLARE JHASH ... has already done that?
Thanks in advance. Your answers will help me to intelligently make big changes in the details for my eventual applications.
Glad to read that my code worked for you. Please find below the answers to your questions.
IF _N_=1 THEN DO; DECLARE HASH obj(DATASET:'AgeDonorDistribs', ordered:'a'); obj.DEFINEKEY('sig'); obj.DEFINEDATA('sig','p1','p2','p3'); obj.DEFINEDONE(); IF 0 THEN SET AgeDonorDistribs ; obj.output(dataset:'check_ht'); END;Alternatively (and requiring more additional code), you could retrieve all values from the hash object (key value by key value) and write them to the log or output window. I don't think it's worth the effort, but if you ever observe a significant discrepancy between check_ht and AgeDonorDistribs, please create a new thread about your finding in this forum.
@LeStatisticien wrote:
Your answers will help me to intelligently make big changes in the details for my eventual applications.
Becoming familiar with the SAS hash object is a much greater endeavor than, say, learning a new function. But it pays off (at least it did for me). Two decent books have been written about this topic (SAS® Hash Object Programming Made Easy and Data Management Solutions Using SAS® Hash Table...). Good luck with the "big changes" you're planning.
Thanks for excellent help, FreelanceReinhard. I will study your points carefully, and benefit from them as I move the code to the 'next level' for my real-world applications. The two references you passed along are gratefully received, and will be used repeatedly in the months ahead. I welcome this new (to me - old to you) and massive improvement in the SAS functionality.
This post is a sort of tutorial for those doing selective simulations inside population micro data files, where multiple attributes of a respondent determine the probability distribution used to control a simulation for that person. The underlying strategy is as ‘old as the hills’; but I have not seen an illustration that involves SAS Hash (look-up) tables.
My file involves simulations for three variables, and they are stacked sets of operations. For each set, I used to move the population file outside SAS, do the work and then bring it back. Now, thanks to FreelanceReinhard and the discussion at https://support.sas.com/resources/papers/proceedings15/3024-2015.pdf, all three steps are done inside SAS, saving a ton of time. So this is huge!
The code below deals with the last set of operations only. The experts here will be annoyed at the needless comments; but this post is for people like me who have used SAS for decades and did not know that doing this sort of simulation work was feasible without leaving SAS. So thanks again FreelanceReinhard.
/* =========== START IMPUTE ADL_Categ HERE */
DATA temp3;
SET sasfiles.simulmar80_ont ;
RUN;
...
DATA ADLCategDonorDistribs;
INFILE '/folders/myfolders/ADL_Categ DonorDist.csv' DELIMITER=',' DSD ;
INPUT sex ageg pCat0 pCat1 pCat2 ;
/* signature == sex ageg; proportions== pCat0 pCat1 pCat2 */
RUN;
/* A donor file line looks like this: 2,18,0.635780988,0.332386646,0.031832366 .
Both the name and the coding for "sex" and "ageg" must be IDENTICAL
between this file and the population file */
/* CAUTION ===== run this segment only when you want to reset the seed. */
DATA _NULL_;
CALL STREAMINIT(0); /* generate seed from system clock */
X = RAND("UNIFORM");
RUN;
%PUT &=SYSRANDOM;
/*
See: https://blogs.sas.com/content/iml/2017/06/01/choose-seed-random-number.html
*/
/* ===== IMPORTANT: this is where we set the number of
imputations that will be generated (for a given variable)
for each respondent. We can use a number greater than 1 at n=1 below
to bootstrap a confidence interval for the imputed value. */
%let n=1; /* sample size per simulation (one record one variable) declared as a macro variable */
QUIT;
/* The simulation is done in this step */
DATA SimulOutput ;
CALL STREAMINIT(313777059);
/* Here we set the seed for the random number drawing, */
/* Now we load into RAM a collection of donor distributions, one for
each unique respondent signature. We load it into a "hash" (or look-up) table.
_N_=1 means the load is executed as soon as the
first observation is brought into RAM */
IF _N_=1 THEN DO;
DECLARE HASH obj(DATASET:'ADLCategDonorDistribs');
obj.DEFINEKEY('sex','ageg');
obj.DEFINEDATA('pCat0','pCat1','pCat2');
obj.DEFINEDONE();
IF 0 THEN SET ADLCategDonorDistribs ;
/* FreelanceReinhard says to leave this alone.
The SAS coders have not revealed why this is needed here */
END;
/* At this point the entire collection
of donor distrubutions is in RAM. SAS should prominently offer
me a chance to confirm that they have been read (loaded)
correctly. FreelanceReinhard has sent me the code
I can used to retrieve what was read.
*/
SET temp3; /* This is the pop (observations) file. */
ADL_Categ2 = ADL_Categ;
/* We are imputing selected values for ADL_Categ;
but we leave the original values unchanged */
/* Retrieve the correct donor distribution for
this particular observation and draw random sample(s).
It is a sample of 1 when &n = 1 .
obj.FIND() searches for the match and when it is found
RAND(...) randomly choses 1 (Cat0), 2 (Cat1) or 3 (Cat2)
under control of the probabilities 'pCat0','pCat1','pCat2' .
*/
IF Age80Plus EQ 1 THEN DO;
IF obj.FIND()=0 THEN
DO sampno=1 to &n;
y=RAND('TABLE', OF pCat0 pCat1 pCat2);
IF y EQ 1 THEN ADL_Categ2 = 0;
IF y EQ 2 THEN ADL_Categ2 = 1 ;
IF y EQ 3 THEN ADL_Categ2 = 2;
OUTPUT;
END;
END;
RUN;
/* obj.FIND() is a function (method??)
that causes a search among lines of the donor distrution file
to find the one whose signature matches
the key declared above. If an eligible line on the PopFile
has no matching key on the DistribDonor file, SAS
jumps to the next PopFile line. You can use user-defined
tables later on to see which eligible lines failed to get
an imputation of ADL_Categ2.
*/
PROC FREQ ;
TABLES sampno sex ageg y ADL_Categ2 ;
RUN;
/* Do these distributions look decent?
Check this before going further.
It may be good to add ageg*ADL_Categ and ageg*ADL_Categ2
for more detailed checking . */
DATA temp4;
SET temp3;
IF Age80Plus EQ 0 ;
ADL_Categ2 = ADL_Categ;
RUN;
/* This operation cannot be done where the look-up
table is being used to find key matches. REM. -- records
with non-matches are simply ignored */
DATA temp5;
SET temp4 SimulOutput ;
RUN;
/* Here we concatenate the two datasets.
Rem. that in SimulOutput all persons are aged 80+,
and in temp4 there are no such persons.
*/
PROC FREQ DATA=temp5;
TABLES sex ageg mar ADL_Categ2
ageg*ADL_Categ
ageg*ADL_Categ2;
RUN;
/* Check that all these look decent. */
DATA sasfiles.Sim_ADLCateg_80_ONT ;
SET temp5;
RUN;
/* This dataset has all the simulations on board */
/* IMP: Breakdown of 80-plus to show
properly simulated details for 80-84, 85-89 and 90+ are
in this file for ageg, mar and ADL_Categ2 ONLY.
For all other tabulations, at ages 80-plus show ONE line for 80-plus,
which will be the survey data (not simulation output). */
Thanks for the link Reeza, and yes I am dealing with public use micro data files. My output involves 'patching' information into File A based on patterns in File B that involve networks of predictor variables that both files share (using hopefully reasonable assumptions and supported where feasible with relevant multivariate analysis), so there is quite a bit of 'synthetic stuff' in my records. I expect i will learn some useful ideas from the page to which you are sending me.
My goodness Reeza you have sent me to a gold mine! Just starting to read the piece I see the exact philosophy that is driving my work, and my situation is a bit easier in that most of my micro data file comprises values drawn from survey observations. I am going to enjoy learning the procedures covered in that paper. So thanks again!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.