I have a data file containing educational test results for students in year levels 3, 5, 7, and 9. Each year level also has a value for "assessed year", based on the calendar year in which the student sat the test (values for calendar year vary by id). However I have many missing values for "assessed year". For each id, each value of assessed year should be 2 years apart. This is what the data should look like:
ID | YearLevel | AssessedYear |
1 | 3 | 2007 |
1 | 5 | 2009 |
1 | 7 | 2011 |
1 | 9 | 2013 |
2 | 3 | 2009 |
2 | 5 | 2011 |
2 | 7 | 2013 |
2 | 9 | 2015 |
3 | 3 | 2011 |
3 | 5 | 2013 |
3 | 7 | 2015 |
3 | 9 | 2017 |
4 | 3 | 2003 |
4 | 5 | 2005 |
4 | 7 | 2007 |
4 | 9 | 2009 |
5 | 3 | 2003 |
5 | 5 | 2005 |
5 | 7 | 2007 |
5 | 9 | 2009 |
But this is what I have:
ID | YearLevel | AssessedYear |
1 | 3 | . |
1 | 5 | 2009 |
1 | 7 | . |
1 | 9 | 2013 |
2 | 3 | 2009 |
2 | 5 | 2011 |
2 | 7 | 2013 |
2 | 9 | . |
3 | 3 | 2011 |
3 | 5 | . |
3 | 7 | . |
3 | 9 | . |
4 | 3 | . |
4 | 5 | 2005 |
4 | 7 | 2007 |
4 | 9 | 2009 |
5 | 3 | . |
5 | 5 | . |
5 | 7 | . |
5 | 9 | 2009 |
I want to fill in the missing cells by calculating +/- 2 years from the non-missing values of assessed year.
Any suggestions will be greatly appreciated, this has been keeping me occupied for some time now!
Thanks
Use the fact that AssessedYear - YearLevel is a constant for a given ID :
data have;
input ID YearLevel AssessedYear;
datalines;
1 3 .
1 5 2009
1 7 .
1 9 2013
2 3 2009
2 5 2011
2 7 2013
2 9 .
3 3 2011
3 5 .
3 7 .
3 9 .
4 3 .
4 5 2005
4 7 2007
4 9 2009
5 3 .
5 5 .
5 7 .
5 9 2009
;
data want;
do until(last.ID);
set have; by ID;
if not missing(AssessedYear) then offset = AssessedYear - YearLevel;
end;
do until(last.ID);
set have; by ID;
AssessedYear = coalesce(AssessedYear, sum(YearLevel, offSet));
output;
end;
drop offset;
run;
Use the fact that AssessedYear - YearLevel is a constant for a given ID :
data have;
input ID YearLevel AssessedYear;
datalines;
1 3 .
1 5 2009
1 7 .
1 9 2013
2 3 2009
2 5 2011
2 7 2013
2 9 .
3 3 2011
3 5 .
3 7 .
3 9 .
4 3 .
4 5 2005
4 7 2007
4 9 2009
5 3 .
5 5 .
5 7 .
5 9 2009
;
data want;
do until(last.ID);
set have; by ID;
if not missing(AssessedYear) then offset = AssessedYear - YearLevel;
end;
do until(last.ID);
set have; by ID;
AssessedYear = coalesce(AssessedYear, sum(YearLevel, offSet));
output;
end;
drop offset;
run;
Thank you!!!
That works perfectly and has saved me more hours of hair pulling. Amazing 🙂
I guess I should have refreshed my screen before answering; @PGStats has already solved this. (nice solution)
Since I've already coded it and since it can be helpful to see different ways of doing things, here's another solution:
DATA WORK.Assessments
WORK.Populated_Years (RENAME=(Year_Level=Populated_Level Assessed_Year=Populated_Year))
;
INFILE DATALINES4
DSD DLM = '09'X
FIRSTOBS = 2
MISSOVER;
INPUT ID $
Year_Level $
Assessed_Year : 4.
;
OUTPUT WORK.Assessments;
IF NOT MISSING(Assessed_Year) THEN
OUTPUT WORK.Populated_Years;
ELSE
DELETE;
DATALINES4;
ID Year Level Assessed_Year
1 3 .
1 5 2009
1 7 .
1 9 2013
2 3 2009
2 5 2011
2 7 2013
2 9 .
3 3 2011
3 5 .
3 7 .
3 9 .
4 3 .
4 5 2005
4 7 2007
4 9 2009
5 3 .
5 5 .
5 7 .
5 9 2009
;;;;
RUN;
DATA Fully_Populated_Assessments;
DROP _: Populated_:;
IF 0 THEN
DO;
SET WORK.Populated_Years;
END;
IF _N_ = 1 THEN
DO;
DECLARE HASH Hsh_Pop_Years (DATASET: 'WORK.Populated_Years');
Hsh_Pop_Years.DefineKey ('ID');
Hsh_Pop_Years.DefineData ('Populated_Level', 'Populated_Year');
Hsh_Pop_Years.DefineDone ();
END;
SET WORK.ASSESSMENTS;
IF MISSING(Assessed_Year) THEN
DO;
_RC = Hsh_Pop_Years.FIND();
IF _RC = 0 THEN
DO;
Assessed_Year = Populated_Year + (Year_Level - Populated_Level);
END;
END;
RUN;
Jim
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.