About Season

Season · ‎10-21-2023

Partial least squares logistic regression is a good method of dealing with collinearity in logistic regression. Currently, it is not supported by SAS, so I think it necessary to include it in SAS/STAT.

Season · ‎10-21-2023

Thank you for sharing!

Season · ‎10-21-2023

Well, I wasn't asking about the definition of cross-validation and AIC. Rather, I was asking the way they can be applied to ascertaining the number of components in partial least squares logistic regression. Thank you!

Season · ‎10-21-2023

Thank you very much!

Season · ‎10-21-2023

Thank you very much for your time spent! I have just finished learning the bulk of PLS logistic modeling. But the book section I read gave little information on the way of selecting the number of components. It just briefly said that cross-validation and goodness-of-fit statistics like the AIC and likelihood ratios can be used. Could you recommend more specific methods on that? Prior to raising my questions here, I had known that PLS logistic regression is a possible choice. But in the field of survival analysis where censoring is common, the "incompatible" nature of logistic regression and all of its generalizations (excluding those that have generalized too far away that have been termed a different name instead of having a suffix "logistic regression", including Cox regression, which is a de facto generalization of conditional logistic regression) with missing data, the quality of the final results are conditional on the quality of imputation. Therefore, I sought to find methods that could handle missing data in other approaches. It is true that while Cox models can handle missing data of the dependent variables without imputation, imputation is a must when it comes to independent variables with missing data, but despite it is not measurable by a number, the dependence of the quality of results on imputation may decrease. Thank you again!

Season · ‎10-21-2023

@PaigeMiller wrote: @Season wrote: @PaigeMiller wrote: I feel like I should add that PLS provides residuals in the X-direction, and residuals in the Y-direction. If there are multiple Y variables, you can test to see (visually or otherwise) if the observation is a multivariate outlier. It may be that there are some observations where the Y variable(s) are univariate outliers but not multivariate outliers; and other observations which are not univariate outliers in the Y variables, but which are multivariate outliers in the Y variables. Similarly, when there are multiple X variables, the analagous situations hold regarding multivariate and univariate outliers. The residual plots will detect these (for X, the multivariate outliers from PROC PLS are called STDXSSE; for Y the multivariate outliers from PROC PLS are called STDYSSE). I realize that for some people, this is un-intuitive and hard to understand, but this is another of PLS's great strengths. I generally agree with your opinion. But I think that the concept of residuals and outliers are all independent on pairs of Xs and Ys. That is, I don't think there is such a thing as univariate or multivariate Y outlier unconditional on X. So, when we say an observation is a multivariate X outlier, we have to point out that it is on which dimension of Y that it is an outlier. I disagree, a data point can be an outlier in the X direction regardless of which Y (if any) it is predictive of. Similarly, a data point can be an outlier in the Y direction regardless of which X might predict it (if any). Yes, the phenomena you mentioned surely exists in reality. When I was typing my reply a few hours ago, I thought of these circumstances and concluded in my mind that these situations can be termed as "univariate outliers of a certain X on all dimensions of Y". The dimensions of Y on which the particular observed value of X of the particular observation is specified anyway. I thought it was merely a matter of the way you express such phenomena in words, so I did not specify it in my previous reply. @PaigeMiller wrote: I would like to further consult on the distribution of the regression coefficient estimates in linear PLS model. Do they follow a normal distribution? There are no distributional assumptions involved in Partial Least Squares regression. Any test of significance is done via bootstrapping or similar method. I ask this question because I am building a PLS model with missing data. PLS with missing data can be handled in PROC PLS using the EM algorithm (option MISSING=EM) or by replacing a missing with the average of the non-missings (option MISSING=AVG). Thank you for your information as well as the tip in PROC PLS! To the best of my knowledge, the EM and mean-imputation techniques are inferior to more advanced ones like multiple imputation with chained equation (MICE), which can be requested via the FCS statement in PROC MI. So generating the imputed sample is not a problem. Yet as I had pointed out, conducting the final pooling process deserves a second thought into the distribution of regression coefficient estimates. Now that it is distribution-free and hence may not necessarily follow normal distribution, I think it would be safer to resort to Box-Cox transformations before I pool them. I found that I missed two questions: (1) so are standard errors of regression coefficient estimates of linear PLS models also computed by nonparametric methods like bootstrap? Is there a formula? (2) Could you recommend articles on computing the standard errors and/or confidence intervals of linear PLS models? Many thanks!

Season · ‎10-21-2023

Thanks for sharing. I read this book while I was looking for formulae of the analogue of standard errors of medians and the first and third quantiles in order to pool my results after multiple imputation in PROC MIANALYZE but found that all of the three statistics don't have a "standard error" as long as their confidence intervals are computed in a distribution-free manner described in this book. In all, it helped me a lot, so it also has my recommendation. I would like to recommend two monographs on confidence intervals to our friends: Confidence Intervals for Discrete Data in Clinical Research and Confidence Intervals for Proportions and Related Measures of Effect Size. I hope they can help data analysts in this Community.

Season · ‎10-21-2023

Hello, I am fascinated by the latent variable modeling capability of partial least squares and its extensions (e.g., partial least squares path modeling). I wonder if such a modeling paradigm has applications in survival analyses techniques such as Cox models. I browsed over literatures in survival analysis and found that current amendments to parameter estimate methods centers shrinkage (e.g., ridge regression, LASSO, etc.), without obvious applications of latent variable modeling in this field. Thank you!

Season · ‎10-21-2023

@PaigeMiller wrote: I feel like I should add that PLS provides residuals in the X-direction, and residuals in the Y-direction. If there are multiple Y variables, you can test to see (visually or otherwise) if the observation is a multivariate outlier. It may be that there are some observations where the Y variable(s) are univariate outliers but not multivariate outliers; and other observations which are not univariate outliers in the Y variables, but which are multivariate outliers in the Y variables. Similarly, when there are multiple X variables, the analagous situations hold regarding multivariate and univariate outliers. The residual plots will detect these (for X, the multivariate outliers from PROC PLS are called STDXSSE; for Y the multivariate outliers from PROC PLS are called STDYSSE). I realize that for some people, this is un-intuitive and hard to understand, but this is another of PLS's great strengths. I generally agree with your opinion. But I think that the concept of residuals and outliers are all independent on pairs of Xs and Ys. That is, I don't think there is such a thing as univariate or multivariate Y outlier unconditional on X. So, when we say an observation is a multivariate X outlier, we have to point out that it is on which dimension of Y that it is an outlier. I would like to further consult on the distribution of the regression coefficient estimates in linear PLS model. Do they follow a normal distribution? I ask this question because I am building a PLS model with missing data. I need to pool the regression coefficients after modeling in each imputed sample. If the regression coefficients did not follow a normal distribution, then they could not be pooled directly. Thank you!

Season · ‎10-21-2023

Thank you, Rick, for your reply and the link of your article you offered! I would like to further consult on the distribution of the regression coefficient estimates in linear PLS model. Do they follow a normal distribution? I ask this question because I am building a PLS model with missing data. I need to pool the regression coefficients after modeling in each imputed sample. If the regression coefficients did not follow a normal distribution, then they could not be pooled directly. Thank you!

Season · ‎10-21-2023

@JanetXu wrote: Q1. After imputation, we plan calculate the endpoint for each imputed data, is this correct? Can we stack all the 10 data sets and then calculate the endpoint? Of course you can stack the datasets, but the correct way of dealing with missing data via multiple imputation (MI) is to calculate the statistics separately in each imputed dataset and combine (pool) them in some way. @JanetXu wrote: Q2. Assume we calculate the endpoint for each imputed data. Then do Wilcoxon Rank Sum test. We will have 10 p-values and 10 corresponding 'z' values, etc. How should we combine them together to get one pooled p-value? How should we make inferences based on the 10 imputed data sets? Thanks. Janet I had explained the ways of doing so in my previous replies. @JanetXu wrote: Third, some 'exact' method is a 'research' till this moment. But for Rubin's rule, from Wilcoxon Rank Sum Test output, which variable should I put into proc MIanalysis, 'z', S, or sumofscore, sumofscore - expectofSum, not think over yet, Any suggestion? Thanks again To be exact, it is not the exact methods of the Wilcoxon sum-of-ranks that are in development but rather the huge field of pooling point estimates of statistics from each individual MI-imputed dataset. As I had explained in previous replies, both z-statistics and P-values can be pooled, but with totally different ways. There seems to be no research comparing the validity of results computed with the two methods, but from a time-saving perspective, you can pool the z-statistics as long as the asymptotic normality makes sense. Usually, there is not exact cut-off of the sample size required to deem the assumption of asymptotic normality holds. A rule of thumb of the cut-off may be 30. That is, if your sample size is larger than 30, then you can resort to pooling the z-statistics, rather than the P-values. @JanetXu wrote: 4) I strongly believe that 'z' from Wilcoxon rank-sum test follows standard normal well. z ~ normal (0, 1), as you wrote, the sigma is just 1. If we have many imputed data (say 100), I have thought to run proc univariate to see if 'z' follows a standard normal. Of course you can conduct a normality test to see if the z-statistics of your samples followed a normal distribution. But I don't think it necessary. @JanetXu wrote: 5) I have tried this method, put 'z' and stderr with '1' into PROC MIANALYZE on my data. Below is what I cannot completely agree with you for the above. In the output, the estimate of the 'z' is, as everyone knows, just simple arithmetis mean. There is a "t for H0, parameter = Theta0"; under it, the value is kind of close to the estimate of 'z'. There is a p-value of P>|t|. So, I sensed, this p-value is assuming the average of 'z' follows a non-central t distribution with non-central parameter of Theta0 under H0? If my understanding is correct, then I doubt this p-value is the 'pooled' p-value we want. Because what we want is a 'best 'z', following normal. Our pooled p-value should from the 'best' z from normal distribution directly. I would think just using the average 'z' to get p-value from normal distribution is a reasonable solution. You have noticed something I noticed when I first delved deep into the field of multiple imputation. It is common in the field of missing data that the distribution of the parameters to be pooled differs from that of the pooled parameter. Consider the case of combining the regression coefficients of logistic regression. All of the regression coefficients to be combine follow an asymptotic normal distribution, yet it is a t-test that ultimately decides whether the pooled regression coefficient of the population is 0, since the pooled regression coefficient follows a t rather than a normal distribution. You can safely conclude that all of the pooled parameters in PROC MIANALYZE follow a t distribution, regardless of the distribution of the original parameters to be pooled. But please note that PROC MIANALYZE is not universal in dealing with missing data. So it is not true that the pooled parameters in the entire field of MI all follow a t distribution. The D2 method I mentioned is an example. The parameters to be pooled follow a Chi-square distribution, yet the pooled parameter follows an F distribution. You may deem the change in distribution in the course of pooling parameters in MI odd (that is what I thought when I was learning it), but that is the case. @JanetXu wrote: 6) From #5 above, it goes back to my initial thinking in my question. I am trying to get a ‘pooled’ statistic (later I thought of, 'z' can be used directly, same as your thought.). a 'pooled sum of score' from each data set, a NEW expected sum of score, a pooled std under H0, etc. idea is not mature. I don't think the sum of ranks (I don't quite understand what the word "score" in the phrase "pooled sum of score" referred to) can be pooled directly because they don't follow an asymptotic normal distribution. Rather, the z-statistic, a transformed sum-of-rank, follows a standard normal distribution given a reasonable sample size.

Season · ‎10-20-2023

Hello, I am interested in this field. But the URL you pasted seems not working and does not link to any material. Could you paste the URL again or report the details (title, author, etc.) of the paper? Thank you!

Season · ‎10-20-2023

Thank you very much for your patient reply! I can feel the passion you conveyed to a stranger who is interested in your research but had not uttered a word to you ever before. Thank you for your kind and passion! I do agree with your opinion that the interest of audience is of great importance on how you present your research results. Prior to raising my question here, I browsed extensively over the literatures in multiple imputation (MI) and thought about that problem on my own. The only way I could thought when clarity and simplicity was taken into consideration was to use the median of estimates of the statistics to be pooled as the estimate of the pooled statistics. But my notion was not supported by previous research, so I raised my question here to see if better methods existed. Here are some more advanced topics concerning MI. Are there any regression diagnostic methods in the presence of MI? The word "regression" in the phrase "regression diagnostic" here stands for the model to be built with the imputed datasets rather than the imputation models. More specifically, I would like to ask about the ways of assessing outliers, collinearity and strong influential observations in the presence of MI. The multiple samples created by MI has enabled a more robust point estimate of the estimands, but it also causes confusions to data analyst as to which sample to choose when it comes to compute regression diagnostic statistics. Thank you again for your kind help!

Season · ‎10-18-2023

Thank you so much, Paige, for your prompt and patient replies! Your expertise has helped me save much time that may will otherwise been spent on reading literatures. It's a pity that SAS Community only supports accepting one single reply as the solution. Thank you again!😀

Season · ‎10-18-2023

OK, thank you very much for your reply! So here is a rather open question: about what should we care about the residual of linear PLS modeling? Were Gauss-Markov not required in PLS, I cannot see any reason to take a look at the residual plots? Or I may ask in a rather closed way: are there any requirements in the residuals of PLS? As far as I know, there seems to be none.

Online Status	Offline
Date Last Visited	3 weeks ago

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: Importing multiple large CSV files with varying data formats in on...

Re: The confidence interval (i.e., band) of a Weibull survival curve

Re: Weights for each variable in Logistic

Re: Dealing with nonproportionality in a Cox PH Model

Re: Plots

Re: How do I eliminate or reduce collinearity amongst mixed variables ...

Include partial least squares logistic regression

Re: How to estimate Confidence intervals - Documentation

Re: Partial least squares/latent variable modeling+survival analysis: ...

Re: Integrating over a non-normal distribution

Re: Partial least squares/latent variable modeling+survival analysis: ...

Re: Way of handling categorical independent variables in partial least...

Re: How to estimate Confidence intervals - Documentation

Partial least squares/latent variable modeling+survival analysis: does...

Re: Way of handling categorical independent variables in partial least...

Re: Way of handling categorical independent variables in partial least...

Re: Wilcoxon rank sum test in SAS, how the expected sum and standard d...

Re: Integrating over a non-normal distribution

Re: Missing Value Imputation

Re: Way of handling categorical independent variables in partial least...

Re: Way of handling categorical independent variables in partial least...

SAS Innovate 2026

SUGA

SAS Explore

The Curiosity Cup

欢迎来到SAS中文社区！