I have about 100,000 observations which are divided into groups of 3-10 obervations. Each obervation has a year variable. I need to delete all observations which do not have a corresponding observation in the same group whose year is five less.
The approach I want to take it to create a set for each group that includes all of the years in that group. Then I would delete those observations for whom their year is not five greater than some year in that set. I just don't know how to create and work with such a "set."
Sort the file based on your "group" identifier, then using a DATA step, define a BY statement for your "group" variable, and use IF / THEN construct(s) with FIRST. and/or LAST. to "RETAIN" some desired "year" variable value to test against each observation. Only output those observations that meet your year filter criteria.
Recommended Google advanced search argument for this topic/post:
Developing Scott's idea for a data step, you have an extra challenge, since the range of "year" in a "group" would be determined by a pass through all rows for a "group". Only after that can you filter the groups.
This kind of "double pass" is most often done in sql. Here is a data step approach
imagine your data are in the "group" order[pre]
data reduced ;
set original( in=checking ) original(in= filtering ) ;
by group ;
if first.group then do ;
if checking then do ;
* collect spread info ;
y1 = min( y1, year) ;
y2 = max( y2, year) ;
if filtering ;[/pre] * first time through is only for checking the min/max year in "group" now we do some "output-ing" ;[pre] if (y2-y1) GT 5 then output ;
retain y1 y2 ;
drop y1 y2 ;
Thanks everyone for the help. Unfortunately, I am not advanced enough to follow your arguments. Can someone recommend a good book? I need to understand the processing flow that SAS uses as well as a bunch of functions.