About Season

Season · ‎10-18-2023

I would like to further ask about the residual of linear PLS modeling. I noticed that SAS could compute the residual plots upon request. Since PLS does not alter the linear nature of the model, does the requirement of residual of the model of each of the dependent variables following the Gauss-Markov assumption still applies?

Season · ‎10-18-2023

OK, I see. In fact, the "mean" of a categorical variable should be termed "mathematical expectation", so it does exist, but not in the manner as continuous ones do. I agree with you on your reply that the standard deviation of a categorical variable does not exist. I now make adjustment to my previous reply. That is, I think it strange to center dummy variables, as they are also discrete.

Season · ‎10-18-2023

Hello, Rick. Thank you for your prompt reply as well! I also express my sincere gratitude to your time and effort spent on arranging the example codes. I would like to further ask about the residual of linear PLS modeling. I noticed that SAS could compute the residual plots upon request. Since PLS does not alter the linear nature of the model, does the requirement of residual of the model of each of the dependent variables following the Gauss-Markov assumption still exists?

Season · ‎10-18-2023

Thank you, Paige, for your prompt answer! I have learnt from SAS Help about the automatic coding of dummy variables by CLASS statement in various PLS procedures. As I have stated, the question I concern is the centering procedure of the variables. It is strange to deem categorical variables as continuous and calculate their means and standard deviations in the very same way. But that seems to be the case in PLS. Of course, the centering procedure is not at all forbidden from a computational perspective, as we can deem categorical predictors as "continuous predictors taking a finite set of values" in this computation process. I just thought it a bit strange, so I came here to see if my understanding about PLS were correct. Anyway, thank you for your reply!

Season · ‎10-18-2023

Hello, I am building a partial least squares (PLS) model with categorical independent variables. I understand that the CLASS statement is useful in telling SAS which of the variables are categorical. But as the theory and SAS Help tells us, all of the variables are centered. Does that rule apply to categorical variables as well? That is, are categorical variables "centered" (via the formula x-xbar/std(x)) in the very same way as continuous variables do without considering their categorical nature? Thank you!

Season · ‎10-15-2023

Thank you for your reminder! Actually, I helped that person not because I anticipated reply from him/her. It seems that he/she had not known much about multiple imputation by then, so I had not expected fruitful reply from the original poster in the first place. Maybe his/her reply to my message would simply be "Thank you". But of course things will change in six years, so maybe the poster has been a master of multiple imputation by now. But it's OK that he/she never respond to my message. The reason why I gave my reply was that I felt happy in the course of doing so. Also, I feel that the problems he/she encountered are in fact commonplace. Other people may benefit from viewing my post. In addition, by replying to him/her, I would like to arouse the attention of statisticians in this Community that there are still frequently encountered problems unsolved in the realm of multiple imputation. Furthermore, such a problem is also ubiquitous in resampling where multiple samples are created. It would be of my great honor if my questions raised here eventually turned first into research projects of statisticians and then solutions to them published in literatures. This is a win-win situation for both of us.

Season · ‎10-15-2023

Wow! 😀Thank you so much, Koen, for your wonderful reply! I never thought of receiving a solution to that problem! I will investigate the literatures you referenced in depth. Thank you again for bearing my question in mind for such a long time!

Season · ‎10-15-2023

Thank you, Koen, for your reply! It seems that this problem is ubiquitous in resampling, where multiple samples are created. However, I have not yet found any research addressing this problem. I previously consulted a statistician of my institution, who responded that misclassification error rate obtained in both manners can be reported simultaneously.

Season · ‎10-15-2023

Hello, Dave. Despite my continuous effort on the very specific issue of pooling Wilcoxon test results in the past month, I found joining the conversation here still fruitful. It suddenly dawned upon me that the methods I mentioned may be too complicated, a small modification of your method may be a good choice. Still, I have some issues regarding to your code. (1) Combine sum-of-rank or the z-statistic? In your code, the variable you pooled via PROC MIANALYZE was sum-of-rank, which may violate the rationale of Rubin's rule of pooling estimands, since Rubin's rule was based upon asymptotic normal distribution of the pooled estimand. In Wilcoxon sum-of-rank test, it is the z-statistic rather than the sum of ranks that follow an asymptotic normal distribution. Therefore, we should pool the z-statistics instead. (2) Potential necessity to specify the EDF= option in PROC MIANALYZE. I wonder if you forgot to specify the EDF= option to override the infinite degrees of freedom defaulted by PROC MIANALYZE. So, in conclusion, I think the most convenient way of pooling results of Wilcoxon sum-of-rank tests is as follows: (1) Obtain the z-statistic of each imputed sample; (2) Pool them via PROC MIANALYZE; (3) Obtain the results. The rationale is as follows: now that the z-statistic correspond to the departure from null hypothesis in each sample and that a z-statistic of 0 stands for not rejecting the null hypothesis. Pooling the Wilcoxon test results translates into a one-sample t-test problem. That is: we have M sample values of a certain statistic (in this case it is the z-statistic) following an asymptotic normal distribution, we would like to see if the population mean of the statistic is 0. The pooling of imputed sample z-statistics is no different than pooling imputed sample means or standard deviations in multiple imputation, which can be easily done in PROC MIANALYZE. The biggest challenge in doing so is to ascertain the standard error of each z-statistic, which is required by PROC MIANALYZE. I had no idea how to compute it in the first place given that we only have one z-statistic each sample, so it would be impossible to compute neither the sample standard deviation nor the sample standard error. But Licht's work enlightened me by pointing out that the z-statistics generated from Wilcoxon sum-or-rank tests essentially follow a standard normal distribution. In the case of pooling the z-statistics, each sample only computes one z-statistic, so all of the population standard errors of the z-statistics are 1/sqrt(1)=1. That problem was solved! We can simply apply a code instructing SAS to add a row of all 1s and use this row as the standard errors. Now we finally discuss the EDF= issue. Admittedly, I have not read any literature introducing the concept of effective degrees of freedom in multiple imputation aside from those pertaining to SAS, and I found the explanation SAS Help provided still not that clear. So I also wonder the exact definition of EDF and whether we should specify this option here. From my view, I think it unnecessary to specify the EDF= option here, given that the EDF= option stands for the degrees of freedom of each and every statistic combined. Now that (1) the z-statistics follow a standard normal distribution to which the concept of degrees of freedom does not apply and (2) the t distribution is also asymptotically standard normal, perhaps we can deem each z-statistic as having infinite degrees of freedom, which is the default of PROC MIANALYZE. There is therefore no need to correct the effective degrees of freedom to a finite value.

Season · ‎10-15-2023

Hello, I happen to ran into your problem around a month ago. Admittedly, little research has paid attention to that issue. Dave's reply is a solution. There are also two approaches (three methods) for you to choose: Approach 1: Cited in Page 149 of van Buuren's Flexible Imputation of Missing Data, Second Edition and Table 2 of Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. The original work was done by Rubin. This method is also called the D2 method, whose nomenclature came from the statistic it computed. Note that the D2 method is used to combine test statistics following a Chi-square distribution and calculating the D2 statistic involves taking the square root of each of the test statistic. That is not applicable to z-statistic, since it is likely that it is negative (<0). So I think a potential way of using the D2 method to combine z-statistics of Wilcoxon tests is: (1) square the z-statistic obtained by SAS to change the distribution of the test statistic from normal into Chi-square; (2) Use the D2 method to pool the squared z-statistics; (3) Obtained P-values of pooled results. Approach 2: It should be noted that the z-statistic in fact comes from normal approximation. The Wilcoxon sum-of-rank itself yields only P-values, as is the case of Fisher exact test of contingency tables. The exact Wilcoxon sum-of-rank test can be done in SAS by the EXACT statement (Please do not forget to append the Wilcoxon statement following the EXACT statement to save computation time!). The second approach focuses on combining the P-values themselves rather than z-statistics. Method 1: Reported in Page 220 of Donald Rubin's Statistical analysis with missing data, 2nd Edition. Please note that this method applies only to one-sided test. Method 2: Reported in Licht, C. (2010). New methods for generating significance levels from multiply-imputed data. PhD thesis, University of Bamberg, Bamberg, Germany. Note that this method was also originally designed for one-sided tests. The author gave a method of tackling two-sided tests with the method he/she proposed: Segregate two-sided tests into two one-sided tests. Please also note that the two-sided P-values of Wilcoxon sum-of-rank test itself are essentially sums of two one-sided tests. Details of exact tests in Wilcoxon sum-of-rank test can be found in SAS Help. Please note that exact tests of Wilcoxon sum-of-rank test are extremely computer-intensive! I am running such a test with around 600 samples that were imputed 100 times on my workstation right now I am typing. It takes around 24 hours to have the test done. In many cases, SAS failed to return to an exact test result as a result of lack of memory. Good luck!

Season · ‎10-15-2023

Hello, there, I am also a medical data analyst that has worked on multiple imputation. First of all, the history of methods for dealing with missing data is rather short. One of the earliest endeavor on that dates back to the 1920s (some 100 years ago), but it was not until the late 1970s (1979) that massive investigation and research on missing data had been carried out. The history of massive research into missing data may be younger than some of the users of SAS Community. (See van Buuren's Flexible Imputation of Missing Data, Second Edition for a more detailed description of the history of human's endeavor to handling missing data). To date, much of the problems regarding missing data remain unsolved. For instance, the theory of generalized additive models (GAMs) was first proposed in the 1980s, but it was not until 2017 (to the best of my knowledge) that the first article on handling missing data in GAMs was published. Now I can answer some of your questions. I cannot answer all because some of them may have remain unsolved and perhaps you should try to browse on the Internet to see someone rather than me or anyone in this Community has given an answer to your question. @Ujjawal wrote: Should missing value imputation and outlier treatment be done prior to splitting data into training and validation data sets? The answer to the question regarding splitting prior to imputation is "yes". See the paper entitled The estimation and use of predictions for the assessment of model performance using large samples with multiply imputed data for details. Please note that this paper was published on 29th January 2015, months before you raised the question in this Community. So it is possible that people have been working on your problem but have not reached a solution that is viable enough to be published. This may be the case of outlier detection and treatment, which is also troubling me. To date (15th October 2023), despite the presence of multiple statistics that are capable of detecting outlier in a single sample in a complete dataset (e.g., Cook's Distance in linear regression), there seems to be no counterpart of these in the arena of multiple imputation (MI). I guess that one of the reasons for the absence may be from the fact that multiple samples are created in MI, thereby causing confusions to whether the data analyst should use one of the imputed sample or the pooled imputed sample to calculate the statistics. So I guess that single imputation might be a potential choice to detect outliers. But of course, my idea has not been validated, so you should browse the Internet to look for answers. Good luck!

Season · ‎10-15-2023

Hello, thanks for sharing your work to us! I have some simple yet possibly unanswered questions to raise concerning multiple imputation (MI). In one word, that is: How should we pool the point estimands of the respective imputation sample when the estimands do not (necessarily) follow a normal distribution? One of the most common problems of this kind encountered in data analyses is the pooling of median: suppose I have a variable with missing data and does not follow a normal distribution, then median is the correct statistic for describing the central tendency of the variable. Now that I have computed M times and therefore have M medians, how should I pool them given the fact that while the mean follows an asymptotic normal distribution while this is not necessarily the case for median and that Rubin's rule of pooling the estimands is based on asymptotic normality? Variable transform is a possible choice. That is, we bypass this problem by transforming the variable into another one that follows a normal distribution via methods like Box-Cox transformation and reported the pooled mean and standard deviation of the transformed rather than the original variable. But given the complexity of Box-Cox transformation to practitioners without professional statistical training (e.g., medical doctors) and the loss of "intuitiveness" and "explanability" (e.g., it is difficult for a person to tell what the mean and standard deviation of a Box-Cox transformed triglyceride are all about, especially for medical doctors), this method does not seem to work. Similar situations frequently occur in medical statistics. Sensitivity, specificity and Youden index are all examples of the problems of this kind. So, when pooling the estimands seem to violate the rationale of Rubin's rule, what should we do? Many thanks!

Season · ‎10-15-2023

Thank you, Dave, for your reply!

Season · ‎09-03-2023

I am building and validating a logistic regression model. I wish to calculate confidence interval of area under ROC curve upon validation in an independent population. However, I found no way of doing so, be I use SCORE statement of PROC LOGISTIC or PROC PLM. What should I do? Many thanks to you all!

Season · ‎04-16-2023

Hello, there. It has been many years since you raised the question. I wonder if that problem has been solved right now. I am also building a logistic regression model and has been working on the issue of misclassification error for some time. As SAS Help shows, the formula of misclassification rate is . Simply speaking, according to this formula, the proportion of observations that were misclassified is designated as the misclassification rate. In SAS Help, it is stated that an observation is classified into the level with the largest probability. For instance, suppose the dependent variable has two levels, 0 and 1. If P(Y=1) (posterior probability) of an observation is not smaller than P(Y=0) (posterior probability), then SAS deems that according to the model, the event coded by Y=1 happens to the observation; and vice versa.

Online Status	Offline
Date Last Visited	3 weeks ago

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: The confidence interval (i.e., band) of a Weibull survival curve

Re: Weights for each variable in Logistic

Re: Dealing with nonproportionality in a Cox PH Model

Re: Plots

Re: How do I eliminate or reduce collinearity amongst mixed variables ...

Re: Way of handling categorical independent variables in partial least...

Re: Way of handling categorical independent variables in partial least...

Re: Way of handling categorical independent variables in partial least...

Re: Way of handling categorical independent variables in partial least...

Way of handling categorical independent variables in partial least squ...

Re: Missing value imputation and Outlier treatment

Re: Cut-off of misclassification error of logistic prediction models

Re: Cut-off of misclassification error of logistic prediction models

Re: Wilcoxon rank sum test in SAS, how the expected sum and standard d...

Re: Wilcoxon rank sum test in SAS, how the expected sum and standard d...

Re: Missing value imputation and Outlier treatment

Re: Missing Value Imputation

Re: Calculation of confidence intervals of AUC of logistic regression ...

Calculation of confidence intervals of AUC of logistic regression duri...

Re: Proc Logistics Score Statement : fifstat option explain Misclassi...

SAS Innovate 2026

SUGA

SAS Explore

The Curiosity Cup

欢迎来到SAS中文社区！