Drawing a random sample from arbitrary distributions when the probabil...

LeStatisticien · Posted 01-25-2020 10:17 PM

Help please. I wish to do a simulation using an arbitrary distribution over the possible values of a categorical variable. The TABLE facility along with RAND seem just what I need; but I have a collection of such distributions. The program must decide which is the correct distribution in this collection to use, based upon the respondent's attributes on a few variables.

I have dug hard in our SAS help literature but I have not found an example where the p(*) can change from one respondent to the next. I would be overjoyed to learn what steps would achieve my goal.

Thanks in advance.

PS. The way that I have done this outside of SAS is to attach a signature (a list of indicator values) to each distribution (the collection resides in an external data file). My program then computes the signature from a respondent's record, and when this signature matches that of one of the collection of distributions the program will use that distribution to proceed with the simulation. Now I want to stop doing this work outside of SAS.

FreelanceReinh · Posted 01-26-2020 08:43 AM

Hello @LeStatisticien,

So, you have a dataset with respondent IDs and several variables, say x1, x2, x3 for example, which determine the distribution to draw a random sample from. The distribution is one of a categorical variable with, say 4 values (1, 2, 3, 4) and hence determined by the probabilities p1, p2, p3 for the first three values (then p4=1-p1-p2-p3).

/* Create test data */

data have;
input id x1-x3;
cards;
1 3 1 4
2 5 9 2
3 8 0 3
;

Option 1: Compute the probabilities from x1, x2, x3 if you know a formula. (I use arbitrary formulas below, just for demonstration.)

%let n=100; /* sample size per ID */

data want;
call streaminit(27182818);
set have;
/* compute (arbitrary) probabilities */
p1=1/(3+min(of x:));
p2=1/(3+max(of x:));
p3=1/(3+mean(of x:));
/* draw random samples of size n */
do sampno=1 to &n;
  y=rand('table', of p1-p3);
  output;
end;
run;

Option 2: Predetermine the probabilities for each "signature" and store them in a SAS dataset. Retrieve them by means of a hash object. (I compute an arbitrary signature below.)

data distr;
input sig p1 p2 p3;
cards;
144      .1 .2 .5
15000    .3 .4 .2
32000    .4 .2 .3
43740    .6 .1 .1
48600    .2 .1 .3
15746400 .2 .3 .2
;

%let n=100; /* sample size per ID */

data want(keep=id sampno y);
call streaminit(27182818);
if _n_=1 then do;
  /* load distribution for each signature into hash table */
  dcl hash h(dataset:'distr');
  h.definekey('sig');
  h.definedata('p1','p2','p3');
  h.definedone();
  if 0 then set distr;
end;
set have;
/* compute signature */
sig=2**x1*3**x2*5**x3;
/* retrieve probabilities and draw random samples */
if h.find()=0 then do sampno=1 to &n;
  y=rand('table', of p1-p3);
  output;
end;
run;

Dataset distr does not need to be sorted.

Reeza · Posted 01-26-2020 01:25 PM

Public Use Micro Files?

Reeza · Posted 01-26-2020 01:27 PM

If you are generating synthetic data this may be helpful.

https://support.sas.com/resources/papers/proceedings17/1224-2017.pdf

Hope it helps.

LeStatisticien · Posted 01-26-2020 04:00 PM

Wonderful FreelanceReinhard! I have used Option1 in the R language and am avoiding it in SAS because the situation does not allow a formula to be built; but even before trying out your Option 2 I am fascinated with the idea. So I am going to "dig in here" and if I get stuck some place I will ask for more help. In any event, at the end to the road, doing all this inside SAS is a huge development since in my simulation work I often am at a point where the micro data file has several variables with simulated values here and there and it is embedded in a SAS dataset.

Cheers!

LeStatisticien · Posted 01-27-2020 12:51 PM

Hello FreelanceReinhard! Your code (adapted in the direction of my eventual application) ran the first time with no error message. So thanks again, and the code is below. I have a few education questions following the code.

/* Create test data */

DATA observations;

INPUT id x1 x2 x3;

CARDS;

1 3 1 4

2 5 9 2

3 8 0 3

;

RUN;

/* donor distributions for the simulation */

DATA AgeDonorDistribs;

INPUT sig p1 p2 p3 ;

CARDS;

144 .1 .2 .5

15000 .3 .4 .2

32000 .4 .2 .3

43740 .6 .1 .1

48600 .2 .1 .3

15746400 .2 .3 .2

;

/* sig, the ID of an imputation donor distribution,

can also be a vector of categorical variables

that represent respondent attributes -

see https://support.sas.com/resources/papers/proceedings15/3024-2015.pdf

and

https://www.linkedin.com/pulse/sas-hash-tables-patrick-cuba

N.B. In a real donor distribution we require p1+p2+p3 = 1 */

%let n=1; /* sample size per ID */

/* n=1 is normal in population simulation, except

where we are setting up sesitivity tests */

DATA SimulOutput(KEEP=id sampno y);

CALL STREAMINIT(27182818);

IF _N_=1 THEN DO;

DECLARE HASH obj(DATASET:'AgeDonorDistribs');

obj.DEFINEKEY('sig');

obj.DEFINEDATA('p1','p2','p3');

obj.DEFINEDONE();

IF 0 THEN SET AgeDonorDistribs ;

/* What on earth does this do??? */

END;

SET observations;

/* compute the signature of the observation now in RAM */

sig=2**x1*3**x2*5**x3;

/* retrieve the correct donor distribution for

this particular observation and draw random sample(s).

It is a sample of 1 when &n = 1 */

IF obj.FIND()=0 THEN

DO sampno=1 to &n;

y=RAND('TABLE', OF p1 p2 p3);

OUTPUT;

END;

RUN;

Questions:

(1) Once the collection of simulation-donor distributions is loaded into RAM from an external file, is there some way for me to instructs SAS to bring the table up to the screen so that I can confirm that the loading has happened correctly?

(2) I expect exhaust to issue a syntax error message AFTER DO here: " IF obj.FIND()=0 THEN DO sampno=1 to &n;". Why did not do so?

(3) Help me here please: what exactly is this expression doing ? IF obj.FIND()=0 . On the surface, it looks as if an actions to be taken up a condition does not exist (see =0); what I see in the output WORK file that appropriate action seems to have been taken.

(4) What error message will I get when the look-up table signatures contain no match for the computed signature of the just loaded record? (My first question is important here because the failure to match could be caused by a reading error as the look-up table was being brought into RAM).

(5) In "IF 0 THEN SET AgeDonorDistribs ;", what is being evaluated at zero, and why do we point to the donor distributions when the DECLARE JHASH ... has already done that?

Thanks in advance. Your answers will help me to intelligently make big changes in the details for my eventual applications.

LeStatisticien · Posted 01-27-2020 12:56 PM

Apologies for all the "typos"; especially "exhaust" which should be SAS.

FreelanceReinh · Posted 01-27-2020 03:28 PM

Glad to read that my code worked for you. Please find below the answers to your questions.

One possible way: Using the OUTPUT method of the SAS hash object you can write the data items of the hash table to a SAS dataset. However, by default, the hash table is not sorted. Its internal sort order is optimized for data retrieval. You can enforce a sort order, though, by specifying the ORDERED argument tag in the declaration of the hash object. To facilitate your check, you will probably want to add the key item (in your case: sig) to the list of data items so that it will be part of the output dataset (arbitrarily named check_ht below), which you could then directly compare to dataset AgeDonorDistribs (with PROC COMPARE), assuming that this is sorted by sig as well. The necessary modifications to your code to prepare this check are highlighted in red below:
```
 IF _N_=1 THEN DO;
    DECLARE HASH obj(DATASET:'AgeDonorDistribs', ordered:'a');
    obj.DEFINEKEY('sig');
    obj.DEFINEDATA('sig','p1','p2','p3');
    obj.DEFINEDONE();
 
    IF 0 THEN SET AgeDonorDistribs ;
    obj.output(dataset:'check_ht');
 END;
```
Alternatively (and requiring more additional code), you could retrieve all values from the hash object (key value by key value) and write them to the log or output window. I don't think it's worth the effort, but if you ever observe a significant discrepancy between check_ht and AgeDonorDistribs, please create a new thread about your finding in this forum.
The FIND method returns 0 if the key value (here: the value of variable sig computed immediately before the IF/THEN statement) is found in the hash table. Otherwise the return code (which could be written to a numeric variable rc) would be a nonzero integer. So the code is similar to "if rc=0 then do ...", which is usual DATA step syntax. The macro variable reference &n resolves to 1 (in your example), thus making the DO loop specification complete. Why would you expect a syntax error? As long as &n=1, you could simplify the code and omit variable sampno (whose constant value 1 would be redundant).
See answer to question 2 above. I understand that the "=0" seems counterintuitive if you're new to the SAS hash object, but that's how the SAS developers defined the return codes of the FIND method.
Currently, no error message would occur. Just the DO loop would not be executed. But, of course, you can add an ELSE branch to the IF/THEN statement and let SAS (in case of a failed search) write a red error message to the log, terminate the DATA step prematurely or do anything else what you deem appropriate.
That SET statement is never executed because the IF condition 0 is (obviously) not met. Its purpose is that during compilation the variables of dataset AgeDonorDistribs are brought into the program data vector (PDV) and that they don't seem to be uninitialized to the compiler. In this particular case it's only avoiding the unnecessary notes "NOTE: Variable p1 is uninitialized." etc. in the log. Again, it's a somewhat counterintuitive characteristic of the SAS hash object syntax that the reference to AgeDonorDistribs in the DATASET argument tag in the DECLARE statement is (in general) not sufficient for these purposes.

@LeStatisticien wrote:

Your answers will help me to intelligently make big changes in the details for my eventual applications.

Becoming familiar with the SAS hash object is a much greater endeavor than, say, learning a new function. But it pays off (at least it did for me). Two decent books have been written about this topic (SAS® Hash Object Programming Made Easy and Data Management Solutions Using SAS® Hash Table...). Good luck with the "big changes" you're planning.

LeStatisticien · Posted 01-28-2020 04:02 AM

Thanks for excellent help, FreelanceReinhard. I will study your points carefully, and benefit from them as I move the code to the 'next level' for my real-world applications. The two references you passed along are gratefully received, and will be used repeatedly in the months ahead. I welcome this new (to me - old to you) and massive improvement in the SAS functionality.

LeStatisticien · Posted 01-30-2020 03:19 PM

This post is a sort of tutorial for those doing selective simulations inside population micro data files, where multiple attributes of a respondent determine the probability distribution used to control a simulation for that person. The underlying strategy is as ‘old as the hills’; but I have not seen an illustration that involves SAS Hash (look-up) tables.

My file involves simulations for three variables, and they are stacked sets of operations. For each set, I used to move the population file outside SAS, do the work and then bring it back. Now, thanks to FreelanceReinhard and the discussion at https://support.sas.com/resources/papers/proceedings15/3024-2015.pdf, all three steps are done inside SAS, saving a ton of time. So this is huge!

The code below deals with the last set of operations only. The experts here will be annoyed at the needless comments; but this post is for people like me who have used SAS for decades and did not know that doing this sort of simulation work was feasible without leaving SAS. So thanks again FreelanceReinhard.

/* =========== START IMPUTE ADL_Categ HERE */

DATA temp3;

SET sasfiles.simulmar80_ont ;

RUN;

...

DATA ADLCategDonorDistribs;

INFILE '/folders/myfolders/ADL_Categ DonorDist.csv' DELIMITER=',' DSD ;

INPUT sex ageg pCat0 pCat1 pCat2 ;

/* signature == sex ageg; proportions== pCat0 pCat1 pCat2 */

RUN;

/* A donor file line looks like this: 2,18,0.635780988,0.332386646,0.031832366 .

Both the name and the coding for "sex" and "ageg" must be IDENTICAL

between this file and the population file */

/* CAUTION ===== run this segment only when you want to reset the seed. */

DATA _NULL_;

CALL STREAMINIT(0); /* generate seed from system clock */

X = RAND("UNIFORM");

RUN;

%PUT &=SYSRANDOM;

/*

See: https://blogs.sas.com/content/iml/2017/06/01/choose-seed-random-number.html

*/

/* ===== IMPORTANT: this is where we set the number of

imputations that will be generated (for a given variable)

for each respondent. We can use a number greater than 1 at n=1 below

to bootstrap a confidence interval for the imputed value. */

%let n=1; /* sample size per simulation (one record one variable) declared as a macro variable */

QUIT;

/* The simulation is done in this step */

DATA SimulOutput ;

CALL STREAMINIT(313777059);

/* Here we set the seed for the random number drawing, */

/* Now we load into RAM a collection of donor distributions, one for

each unique respondent signature. We load it into a "hash" (or look-up) table.

_N_=1 means the load is executed as soon as the

first observation is brought into RAM */

IF _N_=1 THEN DO;

DECLARE HASH obj(DATASET:'ADLCategDonorDistribs');

obj.DEFINEKEY('sex','ageg');

obj.DEFINEDATA('pCat0','pCat1','pCat2');

obj.DEFINEDONE();

IF 0 THEN SET ADLCategDonorDistribs ;

/* FreelanceReinhard says to leave this alone.

The SAS coders have not revealed why this is needed here */

END;

/* At this point the entire collection

of donor distrubutions is in RAM. SAS should prominently offer

me a chance to confirm that they have been read (loaded)

correctly. FreelanceReinhard has sent me the code

I can used to retrieve what was read.

*/

SET temp3; /* This is the pop (observations) file. */

ADL_Categ2 = ADL_Categ;

/* We are imputing selected values for ADL_Categ;

but we leave the original values unchanged */

/* Retrieve the correct donor distribution for

this particular observation and draw random sample(s).

It is a sample of 1 when &n = 1 .

obj.FIND() searches for the match and when it is found

RAND(...) randomly choses 1 (Cat0), 2 (Cat1) or 3 (Cat2)

under control of the probabilities 'pCat0','pCat1','pCat2' .

*/

IF Age80Plus EQ 1 THEN DO;

IF obj.FIND()=0 THEN

DO sampno=1 to &n;

y=RAND('TABLE', OF pCat0 pCat1 pCat2);

IF y EQ 1 THEN ADL_Categ2 = 0;

IF y EQ 2 THEN ADL_Categ2 = 1 ;

IF y EQ 3 THEN ADL_Categ2 = 2;

OUTPUT;

END;

RUN;

/* obj.FIND() is a function (method??)

that causes a search among lines of the donor distrution file

to find the one whose signature matches

the key declared above. If an eligible line on the PopFile

has no matching key on the DistribDonor file, SAS

jumps to the next PopFile line. You can use user-defined

tables later on to see which eligible lines failed to get

an imputation of ADL_Categ2.

*/

PROC FREQ ;

TABLES sampno sex ageg y ADL_Categ2 ;

RUN;

/* Do these distributions look decent?

Check this before going further.

It may be good to add ageg*ADL_Categ and ageg*ADL_Categ2

for more detailed checking . */

DATA temp4;

SET temp3;

IF Age80Plus EQ 0 ;

ADL_Categ2 = ADL_Categ;

RUN;

/* This operation cannot be done where the look-up

table is being used to find key matches. REM. -- records

with non-matches are simply ignored */

DATA temp5;

SET temp4 SimulOutput ;

RUN;

/* Here we concatenate the two datasets.

Rem. that in SimulOutput all persons are aged 80+,

and in temp4 there are no such persons.

*/

PROC FREQ DATA=temp5;

TABLES sex ageg mar ADL_Categ2

ageg*ADL_Categ

ageg*ADL_Categ2;

RUN;

/* Check that all these look decent. */

DATA sasfiles.Sim_ADLCateg_80_ONT ;

SET temp5;

RUN;

/* This dataset has all the simulations on board */

/* IMP: Breakdown of 80-plus to show

properly simulated details for 80-84, 85-89 and 90+ are

in this file for ageg, mar and ADL_Categ2 ONLY.

For all other tabulations, at ages 80-plus show ONE line for 80-plus,

which will be the survey data (not simulation output). */

LeStatisticien · Posted 01-26-2020 04:09 PM

Thanks for the link Reeza, and yes I am dealing with public use micro data files. My output involves 'patching' information into File A based on patterns in File B that involve networks of predictor variables that both files share (using hopefully reasonable assumptions and supported where feasible with relevant multivariate analysis), so there is quite a bit of 'synthetic stuff' in my records. I expect i will learn some useful ideas from the page to which you are sending me.

LeStatisticien · Posted 01-26-2020 04:19 PM

My goodness Reeza you have sent me to a gold mine! Just starting to read the piece I see the exact philosophy that is driving my work, and my situation is a bit easier in that most of my micro data file comprises values drawn from survey observations. I am going to enjoy learning the procedures covered in that paper. So thanks again!

Drawing a random sample from arbitrary distributions when the probabilities vary among respondents

Re: Drawing a random sample from arbitrary distributions when the probabilities vary among responden

Re: Drawing a random sample from arbitrary distributions when the probabilities vary among responden

Re: Drawing a random sample from arbitrary distributions when the probabilities vary among responden

Re: Drawing a random sample from arbitrary distributions when the probabilities vary among responden

Re: Drawing a random sample from arbitrary distributions when the probabilities vary among responden

Re: Drawing a random sample from arbitrary distributions when the probabilities vary among responden

Re: Drawing a random sample from arbitrary distributions when the probabilities vary among responden

Re: Drawing a random sample from arbitrary distributions when the probabilities vary among responden

Re: Drawing a random sample from arbitrary distributions when the probabilities vary among responden

Re: Drawing a random sample from arbitrary distributions when the probabilities vary among responden

Re: Drawing a random sample from arbitrary distributions when the probabilities vary among responden

Registration is open

Registration is open

SAS Training: Just a Click Away