BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
luboxing
Obsidian | Level 7

Hi, I am trying to simulating data with the following joint pmf:

 

 

X

                   

Y

1

2

3

4

5

6

7

8

9

10

Grand Total

1

5.43%

2.33%

0.39%

1.55%

0.00%

1.16%

0.00%

0.39%

0.00%

0.00%

11.24%

2

1.94%

4.65%

1.16%

0.39%

1.94%

0.00%

0.78%

0.39%

0.00%

0.00%

11.24%

3

2.33%

1.55%

4.26%

0.39%

1.94%

0.78%

0.00%

0.00%

0.00%

0.00%

11.24%

4

0.78%

0.78%

1.16%

3.10%

0.78%

2.33%

1.55%

0.78%

0.00%

0.00%

11.24%

5

0.39%

0.39%

0.78%

2.71%

3.49%

0.39%

1.55%

1.16%

0.39%

0.00%

11.24%

6

0.39%

0.39%

1.16%

1.16%

0.78%

2.33%

1.55%

2.33%

0.78%

0.39%

11.24%

7

0.00%

0.00%

0.78%

0.39%

0.39%

2.33%

3.49%

2.33%

1.16%

0.39%

11.24%

8

0.00%

0.78%

1.16%

1.16%

1.55%

0.78%

1.16%

3.88%

0.39%

0.39%

11.24%

9

0.00%

0.00%

0.00%

0.39%

0.39%

0.39%

0.39%

0.00%

3.88%

1.16%

6.59%

10

0.00%

0.39%

0.39%

0.00%

0.00%

0.78%

0.78%

0.00%

0.00%

1.16%

3.49%

Grand Total

11.24%

11.24%

11.24%

11.24%

11.24%

11.24%

11.24%

11.24%

6.59%

3.49%

100.00%

 

One sample of distribution is:

  X                    
Y 1 2 3 4 5 6 7 8 9 10 Grand Total
1   1                 1
2 1                   1
3     1               1
4       1             1
5             1       1
6         1           1
7           1         1
8               1     1
9                 1   1
10                   1 1
Grand Total 1 1 1 1 1 1 1 1 1 1 10

 

Notice that each checked cell (1) occupies a distinct row and column. 

 

Is there a SAS proc or IML functions that simulates from this distribution?

 

Thanks in advance!

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @luboxing,

 

Sorry, either I don't understand your specifications or they are inconsistent. Your first 10x10 table indeed specifies a joint probability mass function (pmf) for the discrete random variables X and Y, which could be simulated, e.g., with the RAND 'Table' function:

%let n=1000000; /* sample size */

data sim(drop=i z);
call streaminit(27182818);
array p[10,10] _temporary_ (14  6  1 4 0 3 0  1  0 0
                             5 12  3 1 5 0 2  1  0 0
                             6  4 11 1 5 2 0  0  0 0
                             2  2  3 8 2 6 4  2  0 0
                             1  1  2 7 9 1 4  3  1 0
                             1  1  3 3 2 6 4  6  2 1
                             0  0  2 1 1 6 9  6  3 1
                             0  2  3 3 4 2 3 10  1 1
                             0  0  0 1 1 1 1  0 10 3
                             0  1  1 0 0 2 2  0  0 3);
do x=1 to 10;
  do y=1 to 10;
    p[y,x]=p[y,x]/258;
  end;
end;
do i=1 to &n;
  z=rand('table', of p[*]);
  x=mod(z-1,10)+1;
  y=(z-x)/10+1;
  output;
end;
run;

(Assuming your percentages are rounded from values of the form n/258, n=0, 1, 2, ... .)

 

Thereafter you indicate that you don't really want to simulate individual pairs (y, x), but rather 10x10 permutation matrices. Of course, if you specify a distribution of such matrices, you can derive from that a distribution of pairs (y, x) by adding the probabilities of those matrices whose (y, x) element is 1 and dividing each of these sums by 10 (each matrix consists of 10 pairs). However, such a derived distribution of pairs must have constant marginal probabilities of 0.1, not 0.1124, 0.0659, etc. like your joint pmf. To see this, look at an arbitrary row or column, say, row 5. If you simulate one million 10x10 permutation matrices, exactly one million of the resulting ten million (y, x) pairs will have y=5 because each matrix has exactly one entry with y=5 -- regardless of the probability distribution of the permutation matrices. So, the marginal probability of y=5 must be 1/10 (and the same for each of the other rows and for each column).

 

Without further specifications, a 10x10 table with the probabilities of the (y, x) pairs would not uniquely determine a probability distribution of 10x10 permutation matrices because such a distribution has fact(10)-1=3,628,799 parameters, whereas the 10x10 table with fixed marginal sums provides merely (10-1)**2=81.

View solution in original post

8 REPLIES 8
sbxkoenk
SAS Super FREQ

Hello,

 

I have moved your post to "Statistical Procedures" board.

 

Your question does not seem very difficult , but I have no time to make a program right now.

 

Here's how to do it in Python. You can proceed the same way in SAS.

Sampling a joint probability mass function
https://www.youtube.com/watch?v=swvslAWviKw

 

Search the internet for :

simulating joint (bivariate) probability mass function     or rather

sampling from joint (bivariate) probability mass function.

 

BR,

Koen

luboxing
Obsidian | Level 7

Koen,

 

Thanks for replying! Since each column and row can only be drawn once, the YouTube solution wouldn't work because it does not exclude the rows and columns already drawn. In this example, I will need draw 10 times and come up with such series as (1,1), (2,2), (3,3), .... (10,10), but the YouTube solution possibly gives (1,1), (2,2), (3, 2) ... which repeats row 2 twice.

 

Instead, I can still use the YouTube code with a modification: in each iteration I reset the cell already drawn to 0 and then recalculate the densities of all other cells (marginal probabilities). This way, with 10 draws I will get the sample. However, this process is still a bit cumbersome. Just wondering if SAS offers something easier.

 

Bo

ballardw
Super User

Maybe this?

 

data sim;
   array perc (10) _temporary_(0.1124,0.1124,0.1124,0.1124,0.1124,0.1124,0.1124,0.1124,0.0659,0.0349);
   array yt (10) _temporary_;
   do x=1 to 10;
      y=.;
      do until (y>0);
        yc=rand('table',of perc(*));
        if whichn(yc,of yt(*))= 0 then do;
           yt[yc]=yc;
           y =yc;
        end; 
      end;
      output;
   end;
   keep x y;
run;

As I understand it you expect to create 10 output pairs of X and Y that exhausts both dimensions in exactly 10 trials. That means, if I understand, that we can actually select one of the values, I picked X, as 1 to 10 then we randomly pick the Y. This uses the marginal probabilities (did check they add to close to 100% or 1, if not you must make sure they do) and placed those values into a temporary array for ease of reference. The Rand ('table') distribution takes a number of parameters that should total to 1 as the "percent" of values that should be 1, 2, 3 .... n. So rand(table, .1, .2, .3, .4) would have a 10% of a 1, 20% of a 2, 30% of 3 and 40% of a 4. IF your  values do not total to 1 then any shortage < 1 creates a left over value.  Example: Rand('table',.1,.2,.3,.35) would have a 35% chance of a 4 and a 05% chance of returning 5.

 

After picking X there is a loop that creates random Y values. If the value of Y has not been previously selected by checking if the value has been set in the array YT using the WHICHN function which looks for value in a list. If present then loop and pull another. If it is not in the array YT then set the value into the array so it is not reused, and the loop is satisfied and written to the output.

 

The arrays make it easier to provide the marginal probabilities in an easy to reference form and to store the values as identified. By using _temporary_ arrays then the values aren't written to the output.

 

luboxing
Obsidian | Level 7

Thanks Ballardw! I sampled from your code for 100 times with macro and here is the resulting distribution (percent*100). It is quite different from the original distribution. 

 

Table of y by x
y x
1 2 3 4 5 6 7 8 9 10 Total
1
1.78
1.68
0.79
0.50
1.98
0.69
0.59
0.69
0.40
0.89
10.00
2
1.09
0.79
1.49
1.68
1.29
0.79
0.59
0.99
0.99
0.30
10.00
3
0.99
0.89
0.99
1.09
0.59
1.19
1.39
0.99
0.99
0.89
10.00
4
1.09
1.49
0.89
0.79
0.99
0.99
1.39
1.29
0.59
0.50
10.00
5
0.99
0.79
1.58
0.89
1.09
0.69
1.09
0.99
1.29
0.59
10.00
6
1.09
1.09
0.79
1.78
0.69
1.39
1.29
0.79
0.89
0.20
10.00
7
1.49
1.39
1.19
0.99
1.29
1.09
0.50
0.79
0.99
0.30
10.00
8
0.99
0.89
1.29
0.99
0.79
1.39
0.89
1.29
0.89
0.59
10.00
9
0.40
0.89
0.79
1.09
0.50
0.69
1.09
1.19
1.58
1.78
10.00
10
0.10
0.10
0.20
0.20
0.79
1.09
1.19
0.99
1.39
3.96
10.00
Total
 
10.00
 
10.00
 
10.00
 
10.00
 
10.00
 
10.00
 
10.00
 
10.00
 
10.00
 
10.00
 
100.00
ballardw
Super User

You would have to show how you sampled from that proposed approach to make any strong comments about that result.

 

I understood that you were taking 10 samples at a time with the restriction that within those 10 no X or Y could repeat. Is that correct or not?

 

It took about 10,000 iterations of just the sampling from the marginal distribution, i.e. that Rand('table') to get visibly close to the marginal percentages. So sample size of 100 is very likely extremely low.

  One result:

y Frequency Percent Cumulative
Frequency
Cumulative
Percent
1 1118 11.18 1118 11.18
2 1102 11.02 2220 22.20
3 1116 11.16 3336 33.36
4 1148 11.48 4484 44.84
5 1124 11.24 5608 56.08
6 1101 11.01 6709 67.09
7 1173 11.73 7882 78.82
8 1103 11.03 8985 89.85
9 651 6.51 9636 96.36
10 364 3.64 10000 100.00

 

FreelanceReinh
Jade | Level 19

Hi @luboxing,

 

Sorry, either I don't understand your specifications or they are inconsistent. Your first 10x10 table indeed specifies a joint probability mass function (pmf) for the discrete random variables X and Y, which could be simulated, e.g., with the RAND 'Table' function:

%let n=1000000; /* sample size */

data sim(drop=i z);
call streaminit(27182818);
array p[10,10] _temporary_ (14  6  1 4 0 3 0  1  0 0
                             5 12  3 1 5 0 2  1  0 0
                             6  4 11 1 5 2 0  0  0 0
                             2  2  3 8 2 6 4  2  0 0
                             1  1  2 7 9 1 4  3  1 0
                             1  1  3 3 2 6 4  6  2 1
                             0  0  2 1 1 6 9  6  3 1
                             0  2  3 3 4 2 3 10  1 1
                             0  0  0 1 1 1 1  0 10 3
                             0  1  1 0 0 2 2  0  0 3);
do x=1 to 10;
  do y=1 to 10;
    p[y,x]=p[y,x]/258;
  end;
end;
do i=1 to &n;
  z=rand('table', of p[*]);
  x=mod(z-1,10)+1;
  y=(z-x)/10+1;
  output;
end;
run;

(Assuming your percentages are rounded from values of the form n/258, n=0, 1, 2, ... .)

 

Thereafter you indicate that you don't really want to simulate individual pairs (y, x), but rather 10x10 permutation matrices. Of course, if you specify a distribution of such matrices, you can derive from that a distribution of pairs (y, x) by adding the probabilities of those matrices whose (y, x) element is 1 and dividing each of these sums by 10 (each matrix consists of 10 pairs). However, such a derived distribution of pairs must have constant marginal probabilities of 0.1, not 0.1124, 0.0659, etc. like your joint pmf. To see this, look at an arbitrary row or column, say, row 5. If you simulate one million 10x10 permutation matrices, exactly one million of the resulting ten million (y, x) pairs will have y=5 because each matrix has exactly one entry with y=5 -- regardless of the probability distribution of the permutation matrices. So, the marginal probability of y=5 must be 1/10 (and the same for each of the other rows and for each column).

 

Without further specifications, a 10x10 table with the probabilities of the (y, x) pairs would not uniquely determine a probability distribution of 10x10 permutation matrices because such a distribution has fact(10)-1=3,628,799 parameters, whereas the 10x10 table with fixed marginal sums provides merely (10-1)**2=81.

luboxing
Obsidian | Level 7

Thanks FreelanceReinh! Yes your code is exactly what I need... I was aware the "table" but of only its vector use. Yes I am trying to simulate the matrix and here is the frequency result from your code. As you can see the marginal probabilities are consistent with my first post! Many thanks!

 

Table of y by x
y x
1 2 3 4 5 6 7 8 9 10 Total
1
5.44
2.33
0.39
1.54
0.00
1.16
0.00
0.39
0.00
0.00
11.24
2
1.95
4.66
1.15
0.38
1.94
0.00
0.78
0.40
0.00
0.00
11.26
3
2.29
1.55
4.26
0.39
1.94
0.77
0.00
0.00
0.00
0.00
11.21
4
0.78
0.77
1.15
3.12
0.75
2.36
1.58
0.76
0.00
0.00
11.25
5
0.39
0.38
0.78
2.67
3.50
0.40
1.55
1.18
0.38
0.00
11.24
6
0.39
0.39
1.16
1.16
0.76
2.31
1.54
2.34
0.77
0.39
11.21
7
0.00
0.00
0.77
0.38
0.38
2.31
3.50
2.35
1.16
0.39
11.23
8
0.00
0.76
1.15
1.17
1.55
0.79
1.14
3.90
0.39
0.39
11.24
9
0.00
0.00
0.00
0.39
0.39
0.39
0.37
0.00
3.90
1.17
6.61
10
0.00
0.39
0.39
0.00
0.00
0.78
0.78
0.00
0.00
1.17
3.51
Total
112473
11.25
112322
11.23
111930
11.19
112112
11.21
112032
11.20
112597
11.26
112363
11.24
113066
11.31
65931
6.59
35174
3.52
1000000
100.00
Ksharp
Super User
You could use hypergeometric distribution to simulate this kind of data.
@Rick_SAS wrote a couple of wonderful blog about this :

https://blogs.sas.com/content/iml/2015/10/21/simulate-contingency-tables-fixed-sums-sas.html
https://blogs.sas.com/content/iml/2015/10/02/balls-and-urns2.html

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1133 views
  • 0 likes
  • 5 in conversation