Hello,
Let me start by saying I'm not a programmer by training, just a biologist doing their best. I learned SAS using the "SAS for R users" modules but I'm still pretty clumsy. I am trying to fix a database error, and know how to do this in R, but I'm working with a very large dataset and need to find either a more efficient method in R or better yet, a way to do it in SAS.
The dataset I'm working with has entries of highest completed education (ed_hc) per per person per year and the date they completed that education (date_ed_hc). Most people look like this, where ed_hc is a code for the specific educational program:
ID | year | ed_hc | date_ed_hc 1 | 2001 | 392 | 12-04-1997 1 | 2002 | 392 | 12-04-1997 1 | 2003 | 412 | 23-06-2003
However, for some people whose records appear to have been updated but not back-corrected, it looks like so:
ID | year | ed_hc | date_ed_hc 2 | 2001 | 373 | 01-01-1970 (should be the same as 2002+2003 since the education was completed in 1977) 2 | 2002 | 465 | 23-07-1977
2 | 2003 | 465 | 23-07-1977
A third set don't seem to have Dec.31 as the reference date for the year (which they should):
ID | year | ed_hc | date_ed_hc 3 | 2001 | 408 | 01-08-1999 3 | 2002 | 408 | 01-08-1999 (should be the same as 2003 since the education was completed in 2002)) 3 | 2003 | 420 | 20-11-2002
I wrote a solution in R that solves these problems nicely, but takes way too long and uses a lot of memory. Would anyone here be able to suggest a direction to go in with SAS?
R solution:
(added empty columns ed_hc_fix and date_ed_hc_fix in SAS, sorted on ID and descending year)
edu$year_ed_hc <-year(edu$date_ed_hc)
IDs <-unique(edu$ID)
for (i in 1:length(IDs)){
q <-edu[edu$ID==IDs[i],]
r <-c(unique(q$year_ed_hc),1950) #1950 is earlier than the earliest date in the dataset
q[q$year>=r[1],"ed_hc_fix"] <-max(q[q$year_ed_hc==r[1],"ed_hc"])
q[q$year>=r[1],"date_ed_hc_fix"] <-min(q[q$year_ed_hc==r[1],"date_ed_hc"])
if (length(r)>2){
for (j in 2:(length(r)-1)){
q[(r[j-1]>q$year & q$year>=r[j]),"ed_hc_fix"] <-max(q[q$year_ed_hc==r[j],"ed_hc"])
q[(r[j-1]>q$year & q$year>=r[j]),"date_ed_hc_fix"] <-min(q[q$year_ed_hc==r[j],"date_ed_hc"])
}
}
edu[edu$ID==ID[i,] <-q
}
What is the output you want for that input? What does the comment about Dec31st mean? There are NO dates in December in your example data.
Let me see if I can restate the problem:
Someone said in 2001 that their highest degree as a BA. Then in 2003 they said they had received an MA in 1997. So you want to assume that the 2003 record is correct and replace the 2001 record? Instead of assuming the date in the 2003 record is wrong?
For the 3rd example/the issue with December, when we say year=2002 it should be the value as of the END of that year (Dec 31). So if someone completed an education in mid-2002, the entry for that person in 2002 should already reflect it.
With regards to the 2nd example, many people go from having an education code with a seeming 'default' date to a different code, but the dates do not align- so in 2002 someone updates from having e.g. BSc, 1970 to a MSc, 1977. Without getting into all the details, it seems like the updated value is the one to trust from 1977-present but it is applied from 2002-present instead.
In both cases the years don't match up and the records need to be back-corrected.
Would a rough logic be, that you take the latest year of highest education as long as that year is less than the year of the education? I'm fairly certain that isn't super efficient R code as well...tidyverse seems like it could be a bit more efficient using a similar group_by and lag logic.
data have;
infile cards dlm='|';
informat ID $2. year 8. ed_hc $8. date_ed_hc ddmmyy10.;
format date_ed_hc ddmmyy10.;
input ID $ year ed_hc $ date_ed_hc;
cards;
1 | 2001 | 392 | 12-04-1997
1 | 2002 | 392 | 12-04-1997
1 | 2003 | 412 | 23-06-2003
2 | 2001 | 373 | 01-01-1970
2 | 2002 | 465 | 23-07-1977
2 | 2003 | 465 | 23-07-1977
3 | 2001 | 408 | 01-08-1999
3 | 2002 | 408 | 01-08-1999
3 | 2003 | 420 | 20-11-2002
;;;;
run;
proc sort data=have;
by id descending year descending date_ed_hc;
run;
data want;
set have;
by ID;
prev_date = lag(date_ed_hc);
if first.id then call missing(prev_date);
if date_ed_hc < prev_date and year(date_ed_hc) <= year then date_ed_hc = prev_date;
format prev_date ddmmyyd10.;;
drop prev_date;
run;
Thanks for the help, I'm going to try this out and report back!
Yes, I'm sure there is a more efficient solution in R as well, I'm no master there either I'm afraid. But even with the most efficient R code I think it'd be a problem, given how slowly it runs basic functions compared to SAS when using large datasets (just loading the file takes twice the time).
Just interleave the education dates and the "years" by ID and DATE and retain the education variables .
data have;
infile cards truncover ;
input id year ed_hc date_ed_hc :ddmmyy. comment $80.;
format date_ed_hc yymmdd10.;
cards;
1 2001 392 12-04-1997
1 2002 392 12-04-1997
1 2003 412 23-06-2003
2 2001 373 01-01-1970 (should be the same as 2002+2003 since the education was completed in 1977)
2 2002 465 23-07-1977
2 2003 465 23-07-1977
3 2001 408 01-08-1999
3 2002 408 01-08-1999 (should be the same as 2003 since the education was completed in 2002))
3 2003 420 20-11-2002
;
proc sql ;
create table dates as
select id, date_ed_hc as date, ed_hc as ed
from have
order by id, date
;
create table years as
select id, year, mdy(12,31,year) as date format=yymmdd10.
from have
order by id, date
;
quit;
data want;
set dates(in=in2) years(in=in1);
by id date;
retain date_ed_hc ed_hc ;
format date_ed_hc yymmdd10.;
if first.id then call missing(date_ed_hc,ed_hc);
if in2 then do; date_ed_hc=date; ed_hc=ed; end;
if in1 ;
keep id year date_ed_hc ed_hc ;
run;
Results:
Obs id year date_ed_hc ed_hc 1 1 2001 1997-04-12 392 2 1 2002 1997-04-12 392 3 1 2003 2003-06-23 412 4 2 2001 1977-07-23 465 5 2 2002 1977-07-23 465 6 2 2003 1977-07-23 465 7 3 2001 1999-08-01 408 8 3 2002 2002-11-20 420 9 3 2003 2002-11-20 420
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.