@JanetXu wrote:
Q1. After imputation, we plan calculate the endpoint for each imputed data, is this correct? Can we stack all the 10 data sets and then calculate the endpoint?
Of course you can stack the datasets, but the correct way of dealing with missing data via multiple imputation (MI) is to calculate the statistics separately in each imputed dataset and combine (pool) them in some way.
@JanetXu wrote:
Q2. Assume we calculate the endpoint for each imputed data. Then do Wilcoxon Rank Sum test. We will have 10 p-values and 10 corresponding 'z' values, etc. How should we combine them together to get one pooled p-value? How should we make inferences based on the 10 imputed data sets?
Thanks.
Janet
I had explained the ways of doing so in my previous replies.
@JanetXu wrote:
Third, some 'exact' method is a 'research' till this moment. But for Rubin's rule, from Wilcoxon Rank Sum Test output, which variable should I put into proc MIanalysis, 'z', S, or sumofscore, sumofscore - expectofSum, not think over yet, Any suggestion? Thanks again
To be exact, it is not the exact methods of the Wilcoxon sum-of-ranks that are in development but rather the huge field of pooling point estimates of statistics from each individual MI-imputed dataset.
As I had explained in previous replies, both z-statistics and P-values can be pooled, but with totally different ways. There seems to be no research comparing the validity of results computed with the two methods, but from a time-saving perspective, you can pool the z-statistics as long as the asymptotic normality makes sense. Usually, there is not exact cut-off of the sample size required to deem the assumption of asymptotic normality holds. A rule of thumb of the cut-off may be 30. That is, if your sample size is larger than 30, then you can resort to pooling the z-statistics, rather than the P-values.
@JanetXu wrote:
4) I strongly believe that 'z' from Wilcoxon rank-sum test follows standard normal well. z ~ normal (0, 1), as you wrote, the sigma is just 1. If we have many imputed data (say 100), I have thought to run proc univariate to see if 'z' follows a standard normal.
Of course you can conduct a normality test to see if the z-statistics of your samples followed a normal distribution. But I don't think it necessary.
@JanetXu wrote:
5) I have tried this method, put 'z' and stderr with '1' into PROC MIANALYZE on my data. Below is what I cannot completely agree with you for the above.
In the output, the estimate of the 'z' is, as everyone knows, just simple arithmetis mean. There is a "t for H0, parameter = Theta0"; under it, the value is kind of close to the estimate of 'z'. There is a p-value of P>|t|. So, I sensed, this p-value is assuming the average of 'z' follows a non-central t distribution with non-central parameter of Theta0 under H0? If my understanding is correct, then I doubt this p-value is the 'pooled' p-value we want. Because what we want is a 'best 'z', following normal. Our pooled p-value should from the 'best' z from normal distribution directly. I would think just using the average 'z' to get p-value from normal distribution is a reasonable solution.
You have noticed something I noticed when I first delved deep into the field of multiple imputation. It is common in the field of missing data that the distribution of the parameters to be pooled differs from that of the pooled parameter. Consider the case of combining the regression coefficients of logistic regression. All of the regression coefficients to be combine follow an asymptotic normal distribution, yet it is a t-test that ultimately decides whether the pooled regression coefficient of the population is 0, since the pooled regression coefficient follows a t rather than a normal distribution. You can safely conclude that all of the pooled parameters in PROC MIANALYZE follow a t distribution, regardless of the distribution of the original parameters to be pooled.
But please note that PROC MIANALYZE is not universal in dealing with missing data. So it is not true that the pooled parameters in the entire field of MI all follow a t distribution. The D2 method I mentioned is an example. The parameters to be pooled follow a Chi-square distribution, yet the pooled parameter follows an F distribution.
You may deem the change in distribution in the course of pooling parameters in MI odd (that is what I thought when I was learning it), but that is the case.
@JanetXu wrote:
6) From #5 above, it goes back to my initial thinking in my question. I am trying to get a ‘pooled’ statistic (later I thought of, 'z' can be used directly, same as your thought.). a 'pooled sum of score' from each data set, a NEW expected sum of score, a pooled std under H0, etc. idea is not mature.
I don't think the sum of ranks (I don't quite understand what the word "score" in the phrase "pooled sum of score" referred to) can be pooled directly because they don't follow an asymptotic normal distribution. Rather, the z-statistic, a transformed sum-of-rank, follows a standard normal distribution given a reasonable sample size.
... View more