I'm working with a large dataset (5,000+ subjects) that were measured anywhere between one and nine times. I'm using repeated measures in proc mixed to analyze my data. My variables are subject, year measured, and size. I have a lot of subjects, however, that were only measured in one year. Does anyone know how to remove subjects that only appear once in the dataset?
Another option to consider is the PROC SORT feature using keyword NODUPKEY and the DUPOUT= parameter.
This approach creates an output file containing those observations that have more than one unique combination of your BY statement variables -- the NODUPKEY parameter (slightly different than NODUPS which interrogates all observation variables looking for duplicate values but only for "adjacent" observations).
Scott Barry
SBBWorks, Inc.
Suggested Google advanced search argument, this topic/post:
The methods that have been mentioned will do what you have requested. But why do you want to remove the subjects who appear only once?
It is not necessary to do so for the purposes of estimation of model parameters. It might be necessary if you believe that the missing observations for those individuals are not missing at random. However, if you believe that the missingness is unrelated to the response, then you would actually be better off leaving these individuals in your analysis.
Thanks for your help. I wanted to remove all of the single observations because most of them are not random, but represent plants that only lived for one year and so were not measured more than once.
Hmm, I don't know that your reasoning is valid. You are censoring plants based on some quality of their response. Thus, you do not have a situation in which the response is missing at random. I would advise against removal of the observations which have only one response. It might be OK to do a sensitivity analysis in which you look at results with and without the plants that lived for only one year. But I think your primary analysis should include all plants.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.