Solved: Re: Simulating multinomial logistic regression

MetinBulus · Posted 09-16-2015 01:48 AM

Anyone know how to design a simulation study with a categorical outcome standing for treatment status (T), and a mixture of normal predictor (X), where X is correlated with the T?

I need to be able to manipulate sample size (approximately at least, using p1, p2, p3 in the code below), mean of X by T (meanN1, meanN2, meaN3 in the code below; and that's why I use mixture normal, perhaps this technique could be wrong), correlation of X with each T levels (beta11, beta12, beta13 in the code below).

I am not sure if I have done what I want in the following code. When I change the total sample size proprotions of treatment levels significantly differ from predetermined probabilities. In addition result sometimes have four treatment levels or two treatment levels. This code doesn't work. Please help.

*NOTE: The following data generating code might be written more efficiently in IML;
option symbolgen;
%let p1 = 2/10;	*probability of T=1;
%let p2 = 3/10;	*probability of T=2;
%let p3 = 5/10;	*probability of T=3;
%let N = 300;	*total obs;
%let meanN1 = -0.2;	*mean of normal covariate X in T=1;
%let meanN2 = 0;	*mean of normal covariate X in T=2;
%let meanN3 = 0.2;	*mean of normal covariate X in T=3;
%let beta11 = log(7);	*relationship of covariate X with T=1;
%let beta12 = log(5);	*relationship of covariate X with T=2;
%let beta13 = log(3);	*relationship of covariate X with T=3;
data Sim (keep=x t) ;
	array x123{&N} _temporary_;
	call STREAMINIT(1982);
	do i=1 to &p1*&N;
		x123{i} = RAND("NORMAL", &meanN1, 1);
		x = x123{i};
		*find beta01 given p1, beta11, meanN1
		 such that mean of all individual mu1 approximate p1;
		eta11 = log(&p1/(1-&p1));
		beta01= eta11 - &beta11*&meanN1;
		*plug in beta01 to obtain individual mu1;
		eta1 = beta01 + &beta11 * x;
		mu1 = exp(eta1) / (1+exp(eta1));	
	end;
	do i=&p1*&N+1 to &p1*&N+&p2*&N;
		x123{i} = RAND("NORMAL", &meanN2, 1);
		x = x123{i};
		*find beta02 given p2, beta12, meanN2
		such that mean of all individual mu2 approximate p2;
		eta22 = log(&p2/(1-&p2));
		beta02= eta22 - &beta12*&meanN1;
		*plug in beta02 to obtain individual mu2;
		eta2 = beta02 + &beta12 * x;
		mu2 = exp(eta2) / (1+exp(eta2));	
	end;
	do i=&p1*&N+&p2*&N+1 to &N;
		x123{i} = RAND("NORMAL", &meanN3, 1);
		x = x123{i};
		*find beta03 given p3, beta13, meanN3
		 such that mean of all individual mu3 approximate p3;
		eta33 = log(&p3/(1-&p3));
		beta03= eta33 - &beta13*&meanN3;
		*plug in beta03 to obtain individual mu3;
		eta3 = beta03 + &beta13 * x;
		mu3 = exp(eta3) / (1+exp(eta3));	
	end;
	*generate the treatment status (1,2,3) given individual probabilities;
	do i=1 to &N;
		x = x123{i};
		t = RAND("TABLE", mu1, mu2, mu3);
		output;
	end;
run;

proc freq data=Sim;
	table t;
run;

Rick_SAS · Posted 09-16-2015 11:07 AM

It looks like you are off to a good start, since your program has a lot of the basic components. If you have the book Simulating Data with SAS, there are several sections that might be useful, including section 12.2 ("Generalized Linear Regression Models") and section 8.2 ("The Multinomial Distribution"). If you are doing complicated simulations like this, I suspect you will find the tips and techniques in the book to be very useful.

From your description, I assume that T is the response variable that you want to generate from explanatory variables X. One question you can ask yourself is "What parameters should I specify in this simulation." You seem to be choosing p1-p3 and the betas. However, in the following paragraph I suggest that the mu_i's are the natural parameter to choose.

Read the section of the PROC GENMOD documentation that describes multinomial models. It provides the mathematical structure that you need to duplicate in your simulation. For example, it says to generate your (possibly correlated) data matrix X, choose regression coefficients beta, and intercept terms mu1, mu2, mu3 that depend on the categories. You then construct the linear model eta_i = mu_i + X`*beta, and apply the inverse link function (which you need to specify...logit?). This will produce three probabilities (one for each mu_i). These are the probabilities that you plug into the RANDMULTINOMIAL function in SAS/IML, or you can use Base SAS to simulate the random multinomial values.

Notice in the previous paragraph that you don't get to choose p1-p3. They result from the other parameters, such as the mu_i's and betas and X matrix..

Depending on what you want to do with this simulated data, the previous algorithm might suffice. If not, then you will have to tackle the harder "inverse problem". Namely, if you insist on specifying p1-p3, then you have to solve for the mu_i's and other parameters that correspond to those target values in the population. I haven't thought much about the multinomial case, but in general this can be hard. If you decide to pursue this, then I recommend reading Chapter 9 "Advanced Simulation of Multivariate Data", which addresses these issues.

View solution in original post

Rick_SAS · Posted 09-16-2015 11:07 AM

It looks like you are off to a good start, since your program has a lot of the basic components. If you have the book Simulating Data with SAS, there are several sections that might be useful, including section 12.2 ("Generalized Linear Regression Models") and section 8.2 ("The Multinomial Distribution"). If you are doing complicated simulations like this, I suspect you will find the tips and techniques in the book to be very useful.

From your description, I assume that T is the response variable that you want to generate from explanatory variables X. One question you can ask yourself is "What parameters should I specify in this simulation." You seem to be choosing p1-p3 and the betas. However, in the following paragraph I suggest that the mu_i's are the natural parameter to choose.

Read the section of the PROC GENMOD documentation that describes multinomial models. It provides the mathematical structure that you need to duplicate in your simulation. For example, it says to generate your (possibly correlated) data matrix X, choose regression coefficients beta, and intercept terms mu1, mu2, mu3 that depend on the categories. You then construct the linear model eta_i = mu_i + X`*beta, and apply the inverse link function (which you need to specify...logit?). This will produce three probabilities (one for each mu_i). These are the probabilities that you plug into the RANDMULTINOMIAL function in SAS/IML, or you can use Base SAS to simulate the random multinomial values.

Notice in the previous paragraph that you don't get to choose p1-p3. They result from the other parameters, such as the mu_i's and betas and X matrix..

Depending on what you want to do with this simulated data, the previous algorithm might suffice. If not, then you will have to tackle the harder "inverse problem". Namely, if you insist on specifying p1-p3, then you have to solve for the mu_i's and other parameters that correspond to those target values in the population. I haven't thought much about the multinomial case, but in general this can be hard. If you decide to pursue this, then I recommend reading Chapter 9 "Advanced Simulation of Multivariate Data", which addresses these issues.

MetinBulus · Posted 09-16-2015 02:26 PM

Thank you Rick! I have refined the simulation code such that probabilities sum to one and RAND("TABLE", mu1, mu2, mu3) consistently generates three treatment levels. Next challenge is to create a system of equations to find beta01, beta02, and beta03 which would in turn help to create balanced or unbalanced design by a predetermined level.

option symbolgen;
%let N = 300;	*total obs;
%let beta11 = log(7);	*relationship of covariate X with T=1;
%let beta12 = log(5);	*relationship of covariate X with T=2;
%let beta13 = &beta11 - &beta12;	*relationship of covariate X with T=3;

data Sim;
	array x123{&N} _temporary_;
	call STREAMINIT(1982);
	do i=1 to &N;
		x123{i} = RAND("NORMAL", 0, 1);
		x = x123{i};

		**Section 1. define linear equations;
		eta12 = beta03 + &beta13 * x;
		eta23 = beta02 + &beta12 * x;
		eta13 = beta01 + &beta11 * x;

		**Section 2. define probabilities for each treatment group
		as a function of linear equations in section 1;
		mu1 = exp(eta13) / (1 + exp(eta13) + exp(eta23));
		mu2 = exp(eta23) / (1 + exp(eta13) + exp(eta23));
		mu3 = 1 / (1 + exp(eta13) + exp(eta23));
		
		** check if probabilities sum to 1;
		mut = mu1 +mu2 +mu3;

		** Section 3. simulate treatment status for each individual
		given their probabilities in section 2;
		t = RAND("TABLE", mu1, mu2, mu3);
		output;
	end;
	
run;