Statistical Procedures

luboxing · Posted 06-21-2023 02:11 PM

Hi, I am trying to simulating data with the following joint pmf:

	X
Y	1	2	3	4	5	6	7	8	9	10	Grand Total
1	5.43%	2.33%	0.39%	1.55%	0.00%	1.16%	0.00%	0.39%	0.00%	0.00%	11.24%
2	1.94%	4.65%	1.16%	0.39%	1.94%	0.00%	0.78%	0.39%	0.00%	0.00%	11.24%
3	2.33%	1.55%	4.26%	0.39%	1.94%	0.78%	0.00%	0.00%	0.00%	0.00%	11.24%
4	0.78%	0.78%	1.16%	3.10%	0.78%	2.33%	1.55%	0.78%	0.00%	0.00%	11.24%
5	0.39%	0.39%	0.78%	2.71%	3.49%	0.39%	1.55%	1.16%	0.39%	0.00%	11.24%
6	0.39%	0.39%	1.16%	1.16%	0.78%	2.33%	1.55%	2.33%	0.78%	0.39%	11.24%
7	0.00%	0.00%	0.78%	0.39%	0.39%	2.33%	3.49%	2.33%	1.16%	0.39%	11.24%
8	0.00%	0.78%	1.16%	1.16%	1.55%	0.78%	1.16%	3.88%	0.39%	0.39%	11.24%
9	0.00%	0.00%	0.00%	0.39%	0.39%	0.39%	0.39%	0.00%	3.88%	1.16%	6.59%
10	0.00%	0.39%	0.39%	0.00%	0.00%	0.78%	0.78%	0.00%	0.00%	1.16%	3.49%
Grand Total	11.24%	11.24%	11.24%	11.24%	11.24%	11.24%	11.24%	11.24%	6.59%	3.49%	100.00%

One sample of distribution is:

	X
Y	1	2	3	4	5	6	7	8	9	10	Grand Total
1		1									1
2	1										1
3			1								1
4				1							1
5							1				1
6					1						1
7						1					1
8								1			1
9									1		1
10										1	1
Grand Total	1	1	1	1	1	1	1	1	1	1	10

Notice that each checked cell (1) occupies a distinct row and column.

Is there a SAS proc or IML functions that simulates from this distribution?

Thanks in advance!

FreelanceReinh · Posted 06-21-2023 06:27 PM

Hi @luboxing,

Sorry, either I don't understand your specifications or they are inconsistent. Your first 10x10 table indeed specifies a joint probability mass function (pmf) for the discrete random variables X and Y, which could be simulated, e.g., with the RAND 'Table' function:

%let n=1000000; /* sample size */

data sim(drop=i z);
call streaminit(27182818);
array p[10,10] _temporary_ (14  6  1 4 0 3 0  1  0 0
                             5 12  3 1 5 0 2  1  0 0
                             6  4 11 1 5 2 0  0  0 0
                             2  2  3 8 2 6 4  2  0 0
                             1  1  2 7 9 1 4  3  1 0
                             1  1  3 3 2 6 4  6  2 1
                             0  0  2 1 1 6 9  6  3 1
                             0  2  3 3 4 2 3 10  1 1
                             0  0  0 1 1 1 1  0 10 3
                             0  1  1 0 0 2 2  0  0 3);
do x=1 to 10;
  do y=1 to 10;
    p[y,x]=p[y,x]/258;
  end;
end;
do i=1 to &n;
  z=rand('table', of p[*]);
  x=mod(z-1,10)+1;
  y=(z-x)/10+1;
  output;
end;
run;

(Assuming your percentages are rounded from values of the form n/258, n=0, 1, 2, ... .)

Thereafter you indicate that you don't really want to simulate individual pairs (y, x), but rather 10x10 permutation matrices. Of course, if you specify a distribution of such matrices, you can derive from that a distribution of pairs (y, x) by adding the probabilities of those matrices whose (y, x) element is 1 and dividing each of these sums by 10 (each matrix consists of 10 pairs). However, such a derived distribution of pairs must have constant marginal probabilities of 0.1, not 0.1124, 0.0659, etc. like your joint pmf. To see this, look at an arbitrary row or column, say, row 5. If you simulate one million 10x10 permutation matrices, exactly one million of the resulting ten million (y, x) pairs will have y=5 because each matrix has exactly one entry with y=5 -- regardless of the probability distribution of the permutation matrices. So, the marginal probability of y=5 must be 1/10 (and the same for each of the other rows and for each column).

Without further specifications, a 10x10 table with the probabilities of the (y, x) pairs would not uniquely determine a probability distribution of 10x10 permutation matrices because such a distribution has fact(10)-1=3,628,799 parameters, whereas the 10x10 table with fixed marginal sums provides merely (10-1)**2=81.

View solution in original post

sbxkoenk · Posted 06-21-2023 02:53 PM

Hello,

I have moved your post to "Statistical Procedures" board.

Your question does not seem very difficult , but I have no time to make a program right now.

Here's how to do it in Python. You can proceed the same way in SAS.

Sampling a joint probability mass function
https://www.youtube.com/watch?v=swvslAWviKw

Search the internet for :

simulating joint (bivariate) probability mass function or rather

sampling from joint (bivariate) probability mass function.

BR,

Koen

luboxing · Posted 06-21-2023 03:47 PM

Koen,

Thanks for replying! Since each column and row can only be drawn once, the YouTube solution wouldn't work because it does not exclude the rows and columns already drawn. In this example, I will need draw 10 times and come up with such series as (1,1), (2,2), (3,3), .... (10,10), but the YouTube solution possibly gives (1,1), (2,2), (3, 2) ... which repeats row 2 twice.

Instead, I can still use the YouTube code with a modification: in each iteration I reset the cell already drawn to 0 and then recalculate the densities of all other cells (marginal probabilities). This way, with 10 draws I will get the sample. However, this process is still a bit cumbersome. Just wondering if SAS offers something easier.

Bo

ballardw · Posted 06-21-2023 05:21 PM

Maybe this?

data sim;
   array perc (10) _temporary_(0.1124,0.1124,0.1124,0.1124,0.1124,0.1124,0.1124,0.1124,0.0659,0.0349);
   array yt (10) _temporary_;
   do x=1 to 10;
      y=.;
      do until (y>0);
        yc=rand('table',of perc(*));
        if whichn(yc,of yt(*))= 0 then do;
           yt[yc]=yc;
           y =yc;
        end; 
      end;
      output;
   end;
   keep x y;
run;

As I understand it you expect to create 10 output pairs of X and Y that exhausts both dimensions in exactly 10 trials. That means, if I understand, that we can actually select one of the values, I picked X, as 1 to 10 then we randomly pick the Y. This uses the marginal probabilities (did check they add to close to 100% or 1, if not you must make sure they do) and placed those values into a temporary array for ease of reference. The Rand ('table') distribution takes a number of parameters that should total to 1 as the "percent" of values that should be 1, 2, 3 .... n. So rand(table, .1, .2, .3, .4) would have a 10% of a 1, 20% of a 2, 30% of 3 and 40% of a 4. IF your values do not total to 1 then any shortage < 1 creates a left over value. Example: Rand('table',.1,.2,.3,.35) would have a 35% chance of a 4 and a 05% chance of returning 5.

After picking X there is a loop that creates random Y values. If the value of Y has not been previously selected by checking if the value has been set in the array YT using the WHICHN function which looks for value in a list. If present then loop and pull another. If it is not in the array YT then set the value into the array so it is not reused, and the loop is satisfied and written to the output.

The arrays make it easier to provide the marginal probabilities in an easy to reference form and to store the values as identified. By using _temporary_ arrays then the values aren't written to the output.

luboxing · Posted 06-21-2023 05:53 PM

Thanks Ballardw! I sampled from your code for 100 times with macro and here is the resulting distribution (percent*100). It is quite different from the original distribution.

Table of y by x
y	x
y	1	2	3	4	5	6	7	8	9	10	Total
1	1.78	1.68	0.79	0.50	1.98	0.69	0.59	0.69	0.40	0.89	10.00
2	1.09	0.79	1.49	1.68	1.29	0.79	0.59	0.99	0.99	0.30	10.00
3	0.99	0.89	0.99	1.09	0.59	1.19	1.39	0.99	0.99	0.89	10.00
4	1.09	1.49	0.89	0.79	0.99	0.99	1.39	1.29	0.59	0.50	10.00
5	0.99	0.79	1.58	0.89	1.09	0.69	1.09	0.99	1.29	0.59	10.00
6	1.09	1.09	0.79	1.78	0.69	1.39	1.29	0.79	0.89	0.20	10.00
7	1.49	1.39	1.19	0.99	1.29	1.09	0.50	0.79	0.99	0.30	10.00
8	0.99	0.89	1.29	0.99	0.79	1.39	0.89	1.29	0.89	0.59	10.00
9	0.40	0.89	0.79	1.09	0.50	0.69	1.09	1.19	1.58	1.78	10.00
10	0.10	0.10	0.20	0.20	0.79	1.09	1.19	0.99	1.39	3.96	10.00
Total	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	10.00	100.00

ballardw · Posted 06-21-2023 06:51 PM

You would have to show how you sampled from that proposed approach to make any strong comments about that result.

I understood that you were taking 10 samples at a time with the restriction that within those 10 no X or Y could repeat. Is that correct or not?

It took about 10,000 iterations of just the sampling from the marginal distribution, i.e. that Rand('table') to get visibly close to the marginal percentages. So sample size of 100 is very likely extremely low.

One result:

y	Frequency	Percent	Cumulative Frequency	Cumulative Percent
1	1118	11.18	1118	11.18
2	1102	11.02	2220	22.20
3	1116	11.16	3336	33.36
4	1148	11.48	4484	44.84
5	1124	11.24	5608	56.08
6	1101	11.01	6709	67.09
7	1173	11.73	7882	78.82
8	1103	11.03	8985	89.85
9	651	6.51	9636	96.36
10	364	3.64	10000	100.00

FreelanceReinh · Posted 06-21-2023 06:27 PM

Hi @luboxing,

Sorry, either I don't understand your specifications or they are inconsistent. Your first 10x10 table indeed specifies a joint probability mass function (pmf) for the discrete random variables X and Y, which could be simulated, e.g., with the RAND 'Table' function:

%let n=1000000; /* sample size */

data sim(drop=i z);
call streaminit(27182818);
array p[10,10] _temporary_ (14  6  1 4 0 3 0  1  0 0
                             5 12  3 1 5 0 2  1  0 0
                             6  4 11 1 5 2 0  0  0 0
                             2  2  3 8 2 6 4  2  0 0
                             1  1  2 7 9 1 4  3  1 0
                             1  1  3 3 2 6 4  6  2 1
                             0  0  2 1 1 6 9  6  3 1
                             0  2  3 3 4 2 3 10  1 1
                             0  0  0 1 1 1 1  0 10 3
                             0  1  1 0 0 2 2  0  0 3);
do x=1 to 10;
  do y=1 to 10;
    p[y,x]=p[y,x]/258;
  end;
end;
do i=1 to &n;
  z=rand('table', of p[*]);
  x=mod(z-1,10)+1;
  y=(z-x)/10+1;
  output;
end;
run;

(Assuming your percentages are rounded from values of the form n/258, n=0, 1, 2, ... .)

Thereafter you indicate that you don't really want to simulate individual pairs (y, x), but rather 10x10 permutation matrices. Of course, if you specify a distribution of such matrices, you can derive from that a distribution of pairs (y, x) by adding the probabilities of those matrices whose (y, x) element is 1 and dividing each of these sums by 10 (each matrix consists of 10 pairs). However, such a derived distribution of pairs must have constant marginal probabilities of 0.1, not 0.1124, 0.0659, etc. like your joint pmf. To see this, look at an arbitrary row or column, say, row 5. If you simulate one million 10x10 permutation matrices, exactly one million of the resulting ten million (y, x) pairs will have y=5 because each matrix has exactly one entry with y=5 -- regardless of the probability distribution of the permutation matrices. So, the marginal probability of y=5 must be 1/10 (and the same for each of the other rows and for each column).

Without further specifications, a 10x10 table with the probabilities of the (y, x) pairs would not uniquely determine a probability distribution of 10x10 permutation matrices because such a distribution has fact(10)-1=3,628,799 parameters, whereas the 10x10 table with fixed marginal sums provides merely (10-1)**2=81.

luboxing · Posted 06-21-2023 06:43 PM

Thanks FreelanceReinh! Yes your code is exactly what I need... I was aware the "table" but of only its vector use. Yes I am trying to simulate the matrix and here is the frequency result from your code. As you can see the marginal probabilities are consistent with my first post! Many thanks!

Table of y by x
y	x
y	1	2	3	4	5	6	7	8	9	10	Total
1	5.44	2.33	0.39	1.54	0.00	1.16	0.00	0.39	0.00	0.00	11.24
2	1.95	4.66	1.15	0.38	1.94	0.00	0.78	0.40	0.00	0.00	11.26
3	2.29	1.55	4.26	0.39	1.94	0.77	0.00	0.00	0.00	0.00	11.21
4	0.78	0.77	1.15	3.12	0.75	2.36	1.58	0.76	0.00	0.00	11.25
5	0.39	0.38	0.78	2.67	3.50	0.40	1.55	1.18	0.38	0.00	11.24
6	0.39	0.39	1.16	1.16	0.76	2.31	1.54	2.34	0.77	0.39	11.21
7	0.00	0.00	0.77	0.38	0.38	2.31	3.50	2.35	1.16	0.39	11.23
8	0.00	0.76	1.15	1.17	1.55	0.79	1.14	3.90	0.39	0.39	11.24
9	0.00	0.00	0.00	0.39	0.39	0.39	0.37	0.00	3.90	1.17	6.61
10	0.00	0.39	0.39	0.00	0.00	0.78	0.78	0.00	0.00	1.17	3.51
Total	112473 11.25	112322 11.23	111930 11.19	112112 11.21	112032 11.20	112597 11.26	112363 11.24	113066 11.31	65931 6.59	35174 3.52	1000000 100.00

Ksharp · Posted 06-22-2023 09:03 AM

You could use hypergeometric distribution to simulate this kind of data.
@Rick_SAS wrote a couple of wonderful blog about this :

https://blogs.sas.com/content/iml/2015/10/21/simulate-contingency-tables-fixed-sums-sas.html
https://blogs.sas.com/content/iml/2015/10/02/balls-and-urns2.html

Statistical Procedures

Simulating from joint pmf

Re: Simulating from joint pmf

Re: Simulating from joint pmf

Re: Simulating from joint pmf

Re: Simulating from joint pmf

Re: Simulating from joint pmf

Re: Simulating from joint pmf

Re: Simulating from joint pmf

Re: Simulating from joint pmf

Re: Simulating from joint pmf

Follow Us

What is...

Statistical Procedures

Our biggest data and AI event of the year.

Follow Us

What is...