I am trying to use the SEVERITY procedure to model the assocations of a handful of independent variables and a dependent variable, which is left-censored at zero. However, I saw from the documentation of PROC SEVERITY that the number designating as the censoring limit by the "LC=" option of the LOSS statement must be strictly larger than zero. So what should I do?
A workaround easy to think of is to assign a minuscule and positive number to the "LC=" option. But will it affect the precision of the results? Thanks!
Is your data left censored or left truncated? The way I read the documentation, left truncation means the result is observed only if Y > T where T is the truncation threshold. Then the documentation defines left censoring if it is known that the magnitude is Y<= C. That may have some effect on the CDF estimates. I suspect that the use of a small value for the cutpoint may then have a different effect, especially for the candidate distributions that are not defined for Y=0. I would be tempted to add a small value to all the observations, and then set the cutoff at that value, just to see what happens.
SteveDenham
Hi @Season
For this issue, you might be best served by contacting Technical Support and requesting SAS Content Assessment Tool support. Here is a link for your convenience:
https://support.sas.com/en/technical-support.html#contact
Thank you for using SAS!
Thank you for answering my question! But I am not sure whether problems related to modeling instead of the software itself is appropriate to be raised to the technical support assistants, who essentially did the job of statistical consultency if I really forwarded my question. In addition, I myself have previously contacted SAS's technical support assistants before and I think that their responses to my questions were rather slow. Moreover, as they typically communicate with me via e-mail, my very few encounters with them all ended up in their non-response, namely they simply stopped replying to my e-mail after briefly replying to me for a couple of times.
In all, I thank you for pointing out this modality of seeking for help but will stay here and wait for somebody to answer for a while. I have chatted with many knowledgable and kind friends on this forum.
Is your data left censored or left truncated? The way I read the documentation, left truncation means the result is observed only if Y > T where T is the truncation threshold. Then the documentation defines left censoring if it is known that the magnitude is Y<= C. That may have some effect on the CDF estimates. I suspect that the use of a small value for the cutpoint may then have a different effect, especially for the candidate distributions that are not defined for Y=0. I would be tempted to add a small value to all the observations, and then set the cutoff at that value, just to see what happens.
SteveDenham
Thank you for taking your precious time to read the documentation! I really appreciate it! As I have said in the original question, my data is censored instead of truncated in that data with values equal zero, the threshold, can be observed in my dataset. However, there is one thing regarding your reply that I cannot be sure of and therefore have to further consult on it:
@SteveDenham wrote:
Is your data left censored or left truncated? The way I read the documentation, left truncation means the result is observed only if Y > T where T is the truncation threshold. Then the documentation defines left censoring if it is known that the magnitude is Y<= C. That may have some effect on the CDF estimates. I suspect that the use of a small value for the cutpoint may then have a different effect, especially for the candidate distributions that are not defined for Y=0.
I am not sure if the "different effect" of the final quoted sentence refers to what I intended to ask. My original question was that I was supposed to explictly define the censoring threshold, namely C, as zero, but was forced to add a tiny yet positive value to it, e.g., C+0.001, by the software. I was unsure whether this would cause the parameter estimates to be incorrect. Now that you mentioned truncation in your thread as well, I am not sure if you meant to express that confounding truncation and censoring would lead to incorrect results.
@SteveDenham wrote:
I would be tempted to add a small value to all the observations, and then set the cutoff at that value, just to see what happens.
SteveDenham
Anyway, regardless of the message you wished to deliver, I think your suggestion is a nice choice for the original question I raised. I had not thought about trying approaches like that before reading your response. Thank you!
So given the definition of left censoring that PROC SEVERITY uses, your response value could potentially be negative. Zero and negative values aren't supported by several of the interesting distributions available to you in SEVERITY. Would those values be meaningful, or even observable? (I only ask as I don't know what the response variable is). If the variables are not observable, then consider that the left truncation approach has some appeal. You can set the truncation value at a small non-zero value, and all of the estimates are correctly determined. The issue becomes what is the small value to use. I think a good way to choose would be to see to how many decimal places the response is measured, and then set the truncation at half that value. For example, suppose you measure the response to the nearest thousandth (=Y.YYY). Under this scheme, the truncation value of 0.0005 would guarantee that it is greater than zero, and that all observed values are included.
Or am I still missing the point here?
SteveDenham
Thank you for your reply!
@SteveDenham wrote:
So given the definition of left censoring that PROC SEVERITY uses, your response value could potentially be negative. Zero and negative values aren't supported by several of the interesting distributions available to you in SEVERITY. Would those values be meaningful, or even observable? (I only ask as I don't know what the response variable is).
In fact, PROC SEVERITY adopts a latent variable modeling paradigm for dealing with censoring data. For instance, if I am the manager of an insurance company and wish to find out issues associated with reimbursement, then I plan to build a linear regression model with the amount of reimbursement (term it "y" here) as the dependent variable and several variables (term them "x1", "x2", ... "xn" here). It is easy to find out that y is not less than 0 (non-negative), essentially violating the assumption of multiple regression where the dependent variable can take any real value.
The latent variable modeling paradigm assumes that y is in fact a partially observed variable of a latent one, which is called y* here. In the example here, the amount of reimbursement is assumed to be closely associated with y* such that if y* is ≥0, then y=y*; if y* is <0, then y=0. The unobserved variable y* is a latent variable because it is not fully observed. On the other hand, if the insurance company does have records where y=0, then the manifest variable y is called censored.
However, if the insurance company only contains records with reimbursement, namely all subjects without reimbursement are not documented in the dataset, then the record only contains cases with y>0 (no equal sign here). In this case, the manifest variable is called truncated. Of note, the latent variable modeling paradigm can be readily applied to the case of truncation. In fact, you can use the same collection of statistical tools built on the latent variable assumption to handle censoring and truncation. However, the researcher himself/herself has to be fully aware of whether his/her data is truncated or censored so as to correctly report and interpret their results, as truncation and censoring are essentially different concepts.
Before I end this thread, I would like to point out that the latent variable modeling paradigm is not the only approach for modeling censored or truncated data. Other modeling paradigms include the two-part model approach and the hurdle model approach. They are, however, not supported by PROC SEVERITY. For a concise yet comprehensive and therefore excellent review, see Two-Part Models for Zero-Modified Count and Semicontinuous Data | SpringerLink.
Now back to my question.
@SteveDenham wrote:
So given the definition of left censoring that PROC SEVERITY uses, your response value could potentially be negative. Zero and negative values aren't supported by several of the interesting distributions available to you in SEVERITY. Would those values be meaningful, or even observable? (I only ask as I don't know what the response variable is). If the variables are not observable, then consider that the left truncation approach has some appeal.
In fact, my data is censored at zero in that it contains cases with y=0. However, my data somehow resembles insurance reimbursement data in that negative values are meaningless and therefore can never be observed. They are only present in the imaginary latent variable.
To summarize, my data contains cases at the censoring threshold. They are not truncated. I know that the same collection of statistical tools (e.g., Tobit models) can be applied to such data regardless on their condition on being censored or truncated, but I beg to point out that I am still skeptical on whether arbitrarily designate censored data as truncated would lead to correct results.
@SteveDenham wrote:
You can set the truncation value at a small non-zero value, and all of the estimates are correctly determined. The issue becomes what is the small value to use. I think a good way to choose would be to see to how many decimal places the response is measured, and then set the truncation at half that value. For example, suppose you measure the response to the nearest thousandth (=Y.YYY). Under this scheme, the truncation value of 0.0005 would guarantee that it is greater than zero, and that all observed values are included.
Or am I still missing the point here?
SteveDenham
I think your solution of tentatively selecting several thresholds and see what happens is a very nice idea. Despite the scheme you proposed was built upon selecting truncation thresholds, such attempts can be easily carried over to the selection of censoring thresholds. Therefore, I tried your approach on my data.
Before I disclose my findings, I would like to reiterate first that my original objective was to model the relationship between y and x1, x2, ..., xn. However, the SEVERITY procedure is versatile and can serve to perform multiple tasks. The more basic one is to estimate the parameters of the distribution(s) that y follow. A more advanced one is to build regression models for the scale parameter of the distribution(s) of y, e.g., the parameter μ if y follows a lognormal distribution. The latter can be done by adding the SCALEMODEL statement in the SEVERITY procedure.
In line with the capabilities of this module, my efforts of implementing your idea was also directed in two directions: (1) estimate the parameters of y in the absence of predictors; (2) estimate both the non-scale parameters of y as well as the regression coefficients of the model for the scale parameter. To accomplish the two goals, I tried several minuscule yet positive thresholds. They were of course smaller than the smallest observed positive value of my dataset.
However, it was disturbing to find out that setting different thresholds did lead to different results. For the first objective, PROC SEVERITY still exhibited some consistency, at least in the estimation of several (but not all!) distributions that were built into this module. For the second objective (i.e., in the presence of predictors), the regression coefficient estimates of the scale parameter model deviated from each other more or less, and even quite wildly on some occasions.
Therefore, the conclusion is that PROC SEVERITY is not a good tool for dealing with zero-censored data as the results is dependent on the specification of the censoring threshold. A supplement to this conclusion is that PROC SEVERITY supports multiple advanced functions relating to parameter estimation, including the specification of starting values that play a role in the maximum likelihood estimation process, the underlying method that this procedure uses to accomplish all of the aforementioned tasks. I am not sure whether delicate application of these utilities could remedy the problems I pointed out in the preceding paragraph, but I have not interest in trying it out.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.