BookmarkSubscribeRSS Feed
igforek
Quartz | Level 8

Hello,
I need some advise about the survival analysis I am perfroming for my research data. I am using PHREG. I have two groups, the one that received the treatment (in) and the control (UN), the variable is named "InfStaStr". The time data is longevity, named "Longev". I also have fecundity (named "totFec"), that indicates the number of eggs laid by the individuals in the exepriment during their lifetime. Another covariate is "Size". I am testing if the treatment ("InfSta"), "TotFec" and "Size", or their interactions, have an effect on "Longev."
I used the full dataset to run a stepwise PHreg and was left with "InfSta" "TotFec" and their interaction "InfSta*TotFec" in the model. The interaction is not significant, (see line marked yellow on "FULL DATA SET" tables of attachment (I think I will remove this term in the final model).
When testing proportionality with the full model, "Tot Fec" seems not to be proprotional as shown in the pink highligthed p-values in table "Testing proportionality by time- dependence FULL DATA SET" of attachment.
I checked my data table and found two Longevity early points (IDs= 4, 58) from the control group (UN) seemed to be the main cause of the problem. So I removed the frist three early Longev data points, IDs=10, 4, 58 (see portion of my data table in EARLY DATA POINTS THAT ARE PROBLEMATIC") in attachments.
I run the stepwise analysis with the reduced data set, and this time I was left with "InfSta", "TotFec" and "InfSta*TotFec" in the model and ALL OF THEM were significant (see tables for "3 observations removed" in attachment). When testing for proportionality, "TotFec" is proportional now (see green highlithed p-values in "Testing proportionality by time- dependence 3 observations removed" of attachment).

 

MY CODES ARE SHOWN IN THE ATTACHMENT.

 

My question is:
Is it justified to remove the first three observations to solve the non-proportionality problem obseved for "TotFec" in the full data set analysis?

 

The full data set has a total oif 74 observations (27 UN, 47 in). After removal of three, I have 25 UN, 46 in)  

8 REPLIES 8
PGStats
Opal | Level 21

Unless there is something special about those observations that invalidates them (other than the fact you described), removing them would be making the data fit the model. Statistical modelling is supposed to go the other way around.

PG
igforek
Quartz | Level 8

Is that why you are a super user? You promptly repond with made up phrases?

 

PGStats
Opal | Level 21

I'm sorry I couldn't provide an answer that's acceptable to you. I'll be watching out this thread for deeper insight.

PG
igforek
Quartz | Level 8

Please, do not tire yourself.

Doc_Duke
Rhodochrosite | Level 12

PG's right.  Just removing data to make the model fit is a form of "data dredging" and can lead to erroneous conclusions.  There have been several books written describing how research has gotten into trouble that way.

 

Those three observations may be the most important in your data set.  Identifying what made them outliers may be much more important than the analysis of the "well behaved" data.

igforek
Quartz | Level 8

Thank you for your answer. Here is the situation. When I use SAS, as explained in my post (and the attachment to my post), the variable TotFec IS NOT proprotional. When I run the analysis in R, Tot Fec is proportional, as seen below:

 

> print(cox.zph(CoxModel.15))
                              rho                  chisq               p
TotFec                  0.056822        1.58e-01         0.691
InfStaStr[T.UN]      0.000708        3.49e-05         0.995
GLOBAL                NA                  1.71e-01         0.918

 

I did test the factor by itself, and agin, its seems to be proportional.

 

> cox.zph(fit, transform = 'identity')
                           rho         chisq         p
TotFec               0.14        0.924       0.336
> cox.zph(fit, transform='log')
                            rho         chisq        p
TotFec                0.181      1.55        0.214

 

As shown above, my variable is proportional in all test performed in R. Additionaly, in the analysis with SAS, only the inclusion of TotFec*LOG(Time) in the cox model produces non-proprotionality for this variable. Another test, using Assess with ph, shows that TotFec is proportional (Supremum test, p=0.0930). 

 

A book on Cox Model, published in 2000, authored by Therneau and Rambsch, "Modeling Survival Data: Extending The Cox Model" (page 144-145) refers to a question by a user that found a somehow similar situation when analyzing his data (SAS shows a variable to be non-proportional, but R otherwise). The authors show that when testing in R, the "identity" tansformation produces a p=0.69 (proportional) and the log transformation produces a p=0.00694 (non-proportional). They point out that after examining the data and transformation graphs, it is clear that the non-proportionality was caused by ONLY two early data points that were outliers in a log plot (hence, the log test has a p=0.00694, as shown obove). They argue that the non-proprotionality for the variable in question should be ignored.

 

I was using an analog rationale for my data set. The points that seem to be causing the non-proprotiuonality are "early" data points.

 

In a separate point, it must be really scary living in a world where everyone who asks a question like this is a "data dredger" (paraphrasing your words). Is not more logical to assume that if someone asks this question is sincerely looking for some advise? How is it logical to assume that someone who wants to cheat in his data analysis would bring this question to light in a public forum? Perhaps I am naive to think that this forum is a place to candidly address doubts on data analysis.

 

Anyway, Thank you for your answer. If you feel that you do not want to answer me back, I understand it. Have a good day. 

 

 

SteveDenham
Jade | Level 19

In five years following this community, and several years following SAS-L before that, I have never seen a reply to a helpful response as rude as the one supplied by @igforek.  Following the "logic" presented, why not just throw out as many data points as necessary to fit your theory.  You could soon be fitting epicycles and deferents perfectly...

 

Steve Denham

igforek
Quartz | Level 8

Thank you sas users for their input. I just wish that you would have paid such much attention to other questions I posted in the past. I have several posts that many not even cared to comment.

 

I understand that no one is under the obligation to answer any of my questions. But, at least if you want to provide an answer, I expect it to be a respectful one. Respect goes both ways.

 

I want to thank a couple of you that did pay attention to my questions in the past and gave answers that were helpful.

 

It was not my intention to offend anyone with my answer. But I see that now some of you feel the need to continue providing sarcastic observations to my question. Please, do not answer my post if you are going to be sarcastic. Show some respect to the person who posted the question too. You cannot judge anyone of you do not have a full knowledge of the facts. Porbably some of you did not even see the attachment I provided with my question.

 

Again, please, do not waste your time to post sarcasm. You also waste my time if you do.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 2005 views
  • 6 likes
  • 4 in conversation