Solved: Lead / lag fonction to calculate Age for Missing Values

MFraga · Posted 09-27-2018 01:26 PM

Hello,

I need to impute missing values for my "age" variable. I have a huge longitudinal dataset (more than ten thousand observations) and every line corresponds to an annually observation. So I need to impute "age" controlling by my "id" variable. Here is an example of the structure of my dataset:

data have;

input id age;

datalines;

1 34

1 35

1 36

1 .

1 40

1 41

1 .

2 .

2 22

2 23

2 24

2 .

2 26

2 .

2 30

3 .

3 50

3 .

run;

I have already sorted the data in a descending way using this program:

proc sort data=have;

by id descending age;

run;

This works fine. My next step is to impute the missing values for my "age" variable. I write this data setp:

data want;

set have;

if age = . and id = lag(id) then age = lag(age) - 1;

run;

Unfortunately, the missing values for "age" remain missing. How could I solve this problem ? Also, for my "id" 3, how could I impute the observations that are above and are also missing ?

What I want as a result for the "age" variable would be:

datalines;

1 34

1 35

1 36

1 37

1 38

1 39

1 40

1 41

1 42

1 43

2 21

2 22

2 23

2 24

2 25

2 26

2 27

2 28

2 29

2 30

3 44

3 45

3 46

3 47

3 48

3 49

3 50

3 51

3 52

3 53

run;

I use SAS 9.3.

Thank you for your help!

novinosrin · Posted 09-27-2018 02:12 PM

Hi @MFraga I have provided a solution. Kindly let me know if that works. Thank you!

View solution in original post

ballardw · Posted 09-27-2018 01:39 PM

@MFraga wrote:

Hello,

I need to impute missing values for my "age" variable. I have a huge longitudinal dataset (more than ten thousand observations) and every line corresponds to an annually observation. So I need to impute "age" controlling by my "id" variable. Here is an example of the structure of my dataset:

First thing, 10,000 observations is not "huge" for SAS. Worry about huge when getting in the 100s of millions or more.

If this is longitudinal data do you have a date for when each record was collected? That would be appropriate to sort by as well to get records in order and might providing additional data that can be used provide a better estimate.

novinosrin · Posted 09-27-2018 01:53 PM

for id 2,

2 .

2 22

2 23

2 24

2 .

2 26

2 .

Since you do not have the first age(i.e being missing) , do you want to the logic to go backwards and forwards?

Oh well, perhaps that's what you meant by lead/lag?

MFraga · Posted 09-27-2018 02:11 PM

Hi novinosrin,

I need to fill all the missing values following the logic idea that we just get old one year every year. Every observation is colected just once every year. I do not want to have missing values for my "age" variable. The way I will sort my data at the end of the imputation will be "ascending" the "age" variable after I have no more missing values. So, for the "id" 2, I need to fill the all the missing values, it does not matter if it is the first or the last observation.

Originally, my data is "ascending" for age so I know that every missing value corresponds to, for instance:

"22 - 1 = 21" to fill the first line for the "id" 2,

or to fill "26 + 1 = 27" for the last line of "id" 2,

or to fill "26 - 1 = 25" or "24 + 1 = 25" for the fifth line of "id" 2.

Thanks for your help!

Astounding · Posted 09-27-2018 02:23 PM

You may need to run through the data twice, because of the possibility of missing values at either the beginning or end of the series.

You also will need to set up an arbitrary variable as a counter that you can use for sorting. Here's an example that assumes your data set is originally in the proper order (by ascending age, but with some missing values):

data halfway;

set have;

by id;

if first.id then counter=1;

else counter + 1;

retain new_age;

if first.id then new_age = .;

if age > . then new_age = age;

else do;

new_age + 1;

age = new_age;

end;

drop new_age;

run;

That will take care of most of the missing values. The only ones left will be those at the beginning of the set of observations for an ID. To handle those, re-sort and repeat the process:

proc sort data=halfway;

by id descending counter;

run;

data want;

set halfway;

by id;

retain new_age;

if first.id then new_age = .;

if age > . then new_age = age;

else do;

new_age + (-1);

age = new_age;

end;

drop new_age counter;

run;

proc sort data=want;

by id age;

run;

novinosrin · Posted 09-27-2018 02:06 PM

data have;
input id age;
datalines;
1 34
1 35
1 36
1 .
1 .
1 .
1 40
1 41
1 .
1 .
2 .
2 22
2 23
2 24
2 .
2 26
2 .
2 .
2 .
2 30
3 .
3 .
3 .
3 .
3 .
3 .
3 50
3 .
3 .
3 .
;
run;

data want(drop=f _n _age age);
call missing(f);
do _n_=1 by 1 until(last.id);
set have;
by id;
if not missing(age) and not f then do;
_n=_n_;
_age=age;
f=1;
end;
end;
do until(last.id);
set have;
by id;
if first.id then __age=_age-_n+1;
else __age+1;;
output;
end;
rename __age=age;
run;

novinosrin · Posted 09-27-2018 02:12 PM

Hi @MFraga I have provided a solution. Kindly let me know if that works. Thank you!

MFraga · Posted 09-27-2018 02:40 PM

Thank you very much! it works!

yabwon · Posted 09-27-2018 02:32 PM

Hi,

10k obs it is nothing 😉 so I think that double reading and POINT= may be used. Try that code below.

Bart

data have;
input id age;
datalines;
1 34
1 35
1 36
1 .
1 .
1 .
1 40
1 41
1 .
1 .
2 .
2 22
2 23
2 24
2 .
2 26
2 .
2 .
2 .
2 30
3 .
3 .
3 .
3 .
3 .
3 .
3 50
3 .
3 .
3 .
run;


ods html;
proc print data = have;
run;

data want;

do point = nobs to 1 by -1;
set have point = point nobs = nobs;

lag_id=lag(id);
lag_age=coalesce(lag(age),lag_age); retain lag_age;

if lag_id and lag_id NE id then lag_age = .;

if age = . and id = lag_id and lag_age ne . then do; age = lag_age - 1; lag_age = age; end;
output;
end;
stop;
drop lag_id lag_age;
run;

ods html;
proc print data = want;
run;

data want2;

do point = nobs to 1 by -1;
set want point = point nobs = nobs;

lag_id=lag(id);
lag_age=coalesce(lag(age),lag_age); retain lag_age;

if lag_id and lag_id NE id then lag_age = .;

if age = . and id = lag_id and lag_age ne . then do; age = lag_age + 1; lag_age = age; end;
output;
end;
stop;
drop lag_id lag_age;
run;

ods html;
proc print data = want2;
run;

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Ksharp · Posted 09-28-2018 09:04 AM

data have;
input id age;
datalines;
1 34
1 35
1 36
1 .
1 .
1 .
1 40
1 41
1 .
1 .
2 .
2 22
2 23
2 24
2 .
2 26
2 .
2 .
2 .
2 30
3 .
3 .
3 .
3 .
3 .
3 .
3 50
3 .
3 .
3 .
;
run;
data temp;
 set have;
 by id;
 if first.id then n=0;
 n+1;
run;

data want;
 merge temp temp(keep=id n age rename=(n=_n age=_age) where=(_age is not missing));
 by id;
 if missing(age) then age=_age+n-_n;
 drop n _:;
run;

Lead / lag fonction to calculate Age for Missing Values

Re: Lead / lag fonction to calculate Age for Missing Values

Re: Lead / lag fonction to calculate Age for Missing Values

Re: Lead / lag fonction to calculate Age for Missing Values

Re: Lead / lag fonction to calculate Age for Missing Values

Re: Lead / lag fonction to calculate Age for Missing Values

Re: Lead / lag fonction to calculate Age for Missing Values

Re: Lead / lag fonction to calculate Age for Missing Values

Re: Lead / lag fonction to calculate Age for Missing Values

Re: Lead / lag fonction to calculate Age for Missing Values

Re: Lead / lag fonction to calculate Age for Missing Values