Re: for loop from R to SAS

elolvido · Posted 04-20-2021 09:40 AM

Hello,

Let me start by saying I'm not a programmer by training, just a biologist doing their best. I learned SAS using the "SAS for R users" modules but I'm still pretty clumsy. I am trying to fix a database error, and know how to do this in R, but I'm working with a very large dataset and need to find either a more efficient method in R or better yet, a way to do it in SAS.

The dataset I'm working with has entries of highest completed education (ed_hc) per per person per year and the date they completed that education (date_ed_hc). Most people look like this, where ed_hc is a code for the specific educational program:

ID | year | ed_hc | date_ed_hc
1  | 2001 | 392   | 12-04-1997
1  | 2002 | 392   | 12-04-1997
1  | 2003 | 412   | 23-06-2003

However, for some people whose records appear to have been updated but not back-corrected, it looks like so:

ID | year | ed_hc | date_ed_hc
2  | 2001 | 373   | 01-01-1970 (should be the same as 2002+2003 since the education was completed in 1977)
2  | 2002 | 465   | 23-07-1977
2  | 2003 | 465   | 23-07-1977

A third set don't seem to have Dec.31 as the reference date for the year (which they should):

ID | year | ed_hc | date_ed_hc
3  | 2001 | 408   | 01-08-1999
3  | 2002 | 408   | 01-08-1999 (should be the same as 2003 since the education was completed in 2002))
3  | 2003 | 420   | 20-11-2002

I wrote a solution in R that solves these problems nicely, but takes way too long and uses a lot of memory. Would anyone here be able to suggest a direction to go in with SAS?

R solution:

(added empty columns ed_hc_fix and date_ed_hc_fix in SAS, sorted on ID and descending year)

edu$year_ed_hc <-year(edu$date_ed_hc)
IDs <-unique(edu$ID)
for (i in 1:length(IDs)){
  q <-edu[edu$ID==IDs[i],]
  r <-c(unique(q$year_ed_hc),1950) #1950 is earlier than the earliest date in the dataset
  q[q$year>=r[1],"ed_hc_fix"] <-max(q[q$year_ed_hc==r[1],"ed_hc"])
  q[q$year>=r[1],"date_ed_hc_fix"] <-min(q[q$year_ed_hc==r[1],"date_ed_hc"])
  if (length(r)>2){
    for (j in 2:(length(r)-1)){
      q[(r[j-1]>q$year & q$year>=r[j]),"ed_hc_fix"] <-max(q[q$year_ed_hc==r[j],"ed_hc"])
      q[(r[j-1]>q$year & q$year>=r[j]),"date_ed_hc_fix"] <-min(q[q$year_ed_hc==r[j],"date_ed_hc"])
    }
  }
  edu[edu$ID==ID[i,] <-q
}

Tom · Posted 04-20-2021 10:16 AM

What is the output you want for that input? What does the comment about Dec31st mean? There are NO dates in December in your example data.

Let me see if I can restate the problem:

Someone said in 2001 that their highest degree as a BA. Then in 2003 they said they had received an MA in 1997. So you want to assume that the 2003 record is correct and replace the 2001 record? Instead of assuming the date in the 2003 record is wrong?

elolvido · Posted 04-20-2021 12:08 PM

For the 3rd example/the issue with December, when we say year=2002 it should be the value as of the END of that year (Dec 31). So if someone completed an education in mid-2002, the entry for that person in 2002 should already reflect it.

With regards to the 2nd example, many people go from having an education code with a seeming 'default' date to a different code, but the dates do not align- so in 2002 someone updates from having e.g. BSc, 1970 to a MSc, 1977. Without getting into all the details, it seems like the updated value is the one to trust from 1977-present but it is applied from 2002-present instead.
In both cases the years don't match up and the records need to be back-corrected.

Reeza · Posted 04-20-2021 10:17 AM

Would a rough logic be, that you take the latest year of highest education as long as that year is less than the year of the education? I'm fairly certain that isn't super efficient R code as well...tidyverse seems like it could be a bit more efficient using a similar group_by and lag logic.

data have;
infile cards dlm='|';
informat ID $2. year 8. ed_hc $8. date_ed_hc ddmmyy10.;
format date_ed_hc ddmmyy10.;
input ID $ year  ed_hc $ date_ed_hc;
cards;
1  | 2001 | 392   | 12-04-1997
1  | 2002 | 392   | 12-04-1997
1  | 2003 | 412   | 23-06-2003
2  | 2001 | 373   | 01-01-1970 
2  | 2002 | 465   | 23-07-1977
2  | 2003 | 465   | 23-07-1977
3  | 2001 | 408   | 01-08-1999
3  | 2002 | 408   | 01-08-1999 
3  | 2003 | 420   | 20-11-2002
;;;;
run;

proc sort data=have;
by id descending year descending date_ed_hc;
run;

data want;
set have;
by ID;
prev_date = lag(date_ed_hc);
if first.id then call missing(prev_date);
if date_ed_hc < prev_date and year(date_ed_hc) <= year then date_ed_hc = prev_date;
format prev_date  ddmmyyd10.;;
drop prev_date;
run;

elolvido · Posted 04-20-2021 12:16 PM

Thanks for the help, I'm going to try this out and report back!

Yes, I'm sure there is a more efficient solution in R as well, I'm no master there either I'm afraid. But even with the most efficient R code I think it'd be a problem, given how slowly it runs basic functions compared to SAS when using large datasets (just loading the file takes twice the time).

Tom · Posted 04-20-2021 10:35 AM

Just interleave the education dates and the "years" by ID and DATE and retain the education variables .

data have;
  infile cards truncover ;
  input id year ed_hc date_ed_hc :ddmmyy. comment $80.;
  format date_ed_hc yymmdd10.;
cards;
1 2001 392 12-04-1997
1 2002 392 12-04-1997
1 2003 412 23-06-2003
2 2001 373 01-01-1970 (should be the same as 2002+2003 since the education was completed in 1977)
2 2002 465 23-07-1977
2 2003 465 23-07-1977
3 2001 408 01-08-1999
3 2002 408 01-08-1999 (should be the same as 2003 since the education was completed in 2002))
3 2003 420 20-11-2002
;

proc sql ;
create table dates as 
  select id, date_ed_hc as date, ed_hc as ed
  from have
  order by id, date 
;
create table years as 
  select id, year, mdy(12,31,year) as date format=yymmdd10.
  from have
  order by id, date 
;
quit;

data want;
   set  dates(in=in2) years(in=in1);
   by id  date;
   retain date_ed_hc ed_hc ;
   format date_ed_hc yymmdd10.;
   if first.id then call missing(date_ed_hc,ed_hc);
   if in2 then do; date_ed_hc=date; ed_hc=ed; end;
   if in1 ;
   keep id year date_ed_hc ed_hc  ;
run;

Results:

Obs    id    year    date_ed_hc    ed_hc

 1      1    2001    1997-04-12     392
 2      1    2002    1997-04-12     392
 3      1    2003    2003-06-23     412
 4      2    2001    1977-07-23     465
 5      2    2002    1977-07-23     465
 6      2    2003    1977-07-23     465
 7      3    2001    1999-08-01     408
 8      3    2002    2002-11-20     420
 9      3    2003    2002-11-20     420

for loop from R to SAS