Hello, Let me start by saying I'm not a programmer by training, just a biologist doing their best. I learned SAS using the "SAS for R users" modules but I'm still pretty clumsy. I am trying to fix a database error, and know how to do this in R, but I'm working with a very large dataset and need to find either a more efficient method in R or better yet, a way to do it in SAS. The dataset I'm working with has entries of highest completed education (ed_hc) per per person per year and the date they completed that education (date_ed_hc). Most people look like this, where ed_hc is a code for the specific educational program: ID | year | ed_hc | date_ed_hc
1 | 2001 | 392 | 12-04-1997
1 | 2002 | 392 | 12-04-1997
1 | 2003 | 412 | 23-06-2003 However, for some people whose records appear to have been updated but not back-corrected, it looks like so: ID | year | ed_hc | date_ed_hc
2 | 2001 | 373 | 01-01-1970 (should be the same as 2002+2003 since the education was completed in 1977)
2 | 2002 | 465 | 23-07-1977 2 | 2003 | 465 | 23-07-1977 A third set don't seem to have Dec.31 as the reference date for the year (which they should): ID | year | ed_hc | date_ed_hc
3 | 2001 | 408 | 01-08-1999
3 | 2002 | 408 | 01-08-1999 (should be the same as 2003 since the education was completed in 2002))
3 | 2003 | 420 | 20-11-2002 I wrote a solution in R that solves these problems nicely, but takes way too long and uses a lot of memory. Would anyone here be able to suggest a direction to go in with SAS? R solution: (added empty columns ed_hc_fix and date_ed_hc_fix in SAS, sorted on ID and descending year) edu$year_ed_hc <-year(edu$date_ed_hc) IDs <-unique(edu$ID) for (i in 1:length(IDs)){ q <-edu[edu$ID==IDs[i],] r <-c(unique(q$year_ed_hc),1950) #1950 is earlier than the earliest date in the dataset q[q$year>=r[1],"ed_hc_fix"] <-max(q[q$year_ed_hc==r[1],"ed_hc"]) q[q$year>=r[1],"date_ed_hc_fix"] <-min(q[q$year_ed_hc==r[1],"date_ed_hc"]) if (length(r)>2){ for (j in 2:(length(r)-1)){ q[(r[j-1]>q$year & q$year>=r[j]),"ed_hc_fix"] <-max(q[q$year_ed_hc==r[j],"ed_hc"]) q[(r[j-1]>q$year & q$year>=r[j]),"date_ed_hc_fix"] <-min(q[q$year_ed_hc==r[j],"date_ed_hc"]) } } edu[edu$ID==ID[i,] <-q }
... View more