BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MFraga
Quartz | Level 8

Hello,

 

I need to impute missing values for my "age" variable. I have a huge longitudinal dataset (more than ten thousand observations) and every line corresponds to an annually observation. So I need to impute "age" controlling by my "id" variable. Here is an example of the structure of my dataset:

 

 

data have;

input id age;

datalines;

1 34

1 35

1 36

1 .

1 .

1 .

1 40

1 41

1 .

1 .

2 .

2 22

2 23

2 24

2 .

2 26

2 .

2 .

2 .

2 30

3 .

3 .

3 .

3 .

3 .

3 .

3 50

3 .

3 .

3 .

run;

 

 

I have already sorted the data in a descending way using this program:

 

proc sort data=have;

by id descending age;

run;

 

 

This works fine. My next step is to impute the missing values for my "age" variable. I write this data setp:

 

data want;

set have;

if age = . and id = lag(id) then age = lag(age) - 1;

run;

 

Unfortunately, the missing values for "age" remain missing. How could I solve this problem ? Also, for my "id" 3, how could I impute the observations that are above and are also missing ?

 

What I want as a result for the "age" variable would be:

 

 

datalines;

1 34

1 35

1 36

1 37

1 38

1 39

1 40

1 41

1 42

1 43

2 21

2 22

2 23

2 24

2 25

2 26

2 27

2 28

2 29

2 30

3 44

3 45

3 46

3 47

3 48

3 49

3 50

3 51

3 52

3 53

run;

 

 

I use SAS 9.3.

 

Thank you for your help!

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Hi @MFraga I have provided a solution. Kindly let me know if that works. Thank you!

View solution in original post

9 REPLIES 9
ballardw
Super User

@MFraga wrote:

Hello,

 

I need to impute missing values for my "age" variable. I have a huge longitudinal dataset (more than ten thousand observations) and every line corresponds to an annually observation. So I need to impute "age" controlling by my "id" variable. Here is an example of the structure of my dataset: 

 


First thing, 10,000 observations is not "huge" for SAS. Worry about huge when getting in the 100s of millions or more.

 

If this is longitudinal data do you have a date for when each record was collected? That would be appropriate to sort by as well to get records in order and might providing additional data that can be used provide a better estimate.

novinosrin
Tourmaline | Level 20

for id 2,

2 .

2 22

2 23

2 24

2 .

2 26

2 .

 

Since you do not have the first age(i.e being missing) , do you want to the logic to go backwards and forwards?

Oh well, perhaps that's what you meant by lead/lag?

MFraga
Quartz | Level 8

Hi novinosrin,

 

I need to fill all the missing values following the logic idea that we just get old one year every year. Every observation is colected just once every year. I do not want to have missing values for my "age" variable. The way I will sort my data at the end of the imputation will be "ascending" the "age" variable after I have no more missing values. So, for the "id" 2, I need to fill the all the missing values, it does not matter if it is the first or the last observation.

 

Originally, my data is "ascending" for age so I know that every missing value corresponds to, for instance:

 

"22 - 1 = 21" to fill the first line for the "id" 2,

 

or to fill "26 + 1 = 27" for the last line of "id" 2,

 

or to fill "26 - 1 = 25" or "24 + 1 = 25" for the fifth line of "id" 2.

 

Thanks for your help!

Astounding
PROC Star

You may need to run through the data twice, because of the possibility of missing values at either the beginning or end of the series.

 

You also will need to set up an arbitrary variable as a counter that you can use for sorting.  Here's an example that assumes your data set is originally in the proper order (by ascending age, but with some missing values):

 

data halfway;

set have;

by id;

if first.id then counter=1;

else counter + 1;

retain new_age;

if first.id then new_age = .;

if age > . then new_age = age;

else do;

   new_age + 1;

   age = new_age;

end;

drop new_age;

run;

 

That will take care of most of the missing values.  The only ones left will be those at the beginning of the set of observations for an ID.  To handle those, re-sort and repeat the process:

 

proc sort data=halfway;

by id descending counter;

run;

 

data want;

set halfway;

by id;

retain new_age;

if first.id then new_age = .;

if age > . then new_age = age;

else do;

   new_age + (-1);

   age = new_age;

end;

drop new_age counter;

run;

 

proc sort data=want;

by id age;

run;

novinosrin
Tourmaline | Level 20
data have;
input id age;
datalines;
1 34
1 35
1 36
1 .
1 .
1 .
1 40
1 41
1 .
1 .
2 .
2 22
2 23
2 24
2 .
2 26
2 .
2 .
2 .
2 30
3 .
3 .
3 .
3 .
3 .
3 .
3 50
3 .
3 .
3 .
;
run;

data want(drop=f _n _age age);
call missing(f);
do _n_=1 by 1 until(last.id);
set have;
by id;
if not missing(age) and not f then do;
_n=_n_;
_age=age;
f=1;
end;
end;
do until(last.id);
set have;
by id;
if first.id then __age=_age-_n+1;
else __age+1;;
output;
end;
rename __age=age;
run;
novinosrin
Tourmaline | Level 20

Hi @MFraga I have provided a solution. Kindly let me know if that works. Thank you!

MFraga
Quartz | Level 8

Thank you very much! it works!

yabwon
Onyx | Level 15

Hi,

 

10k obs it is nothing 😉 so I think that double reading and POINT= may be used. Try that code below.

 

Bart

 

data have;
input id age;
datalines;
1 34
1 35
1 36
1 .
1 .
1 .
1 40
1 41
1 .
1 .
2 .
2 22
2 23
2 24
2 .
2 26
2 .
2 .
2 .
2 30
3 .
3 .
3 .
3 .
3 .
3 .
3 50
3 .
3 .
3 .
run;


ods html;
proc print data = have;
run;

data want;

do point = nobs to 1 by -1;
set have point = point nobs = nobs;

lag_id=lag(id);
lag_age=coalesce(lag(age),lag_age); retain lag_age;

if lag_id and lag_id NE id then lag_age = .;

if age = . and id = lag_id and lag_age ne . then do; age = lag_age - 1; lag_age = age; end;
output;
end;
stop;
drop lag_id lag_age;
run;

ods html;
proc print data = want;
run;

data want2;

do point = nobs to 1 by -1;
set want point = point nobs = nobs;

lag_id=lag(id);
lag_age=coalesce(lag(age),lag_age); retain lag_age;

if lag_id and lag_id NE id then lag_age = .;

if age = . and id = lag_id and lag_age ne . then do; age = lag_age + 1; lag_age = age; end;
output;
end;
stop;
drop lag_id lag_age;
run;

ods html;
proc print data = want2;
run;

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Ksharp
Super User
data have;
input id age;
datalines;
1 34
1 35
1 36
1 .
1 .
1 .
1 40
1 41
1 .
1 .
2 .
2 22
2 23
2 24
2 .
2 26
2 .
2 .
2 .
2 30
3 .
3 .
3 .
3 .
3 .
3 .
3 50
3 .
3 .
3 .
;
run;
data temp;
 set have;
 by id;
 if first.id then n=0;
 n+1;
run;

data want;
 merge temp temp(keep=id n age rename=(n=_n age=_age) where=(_age is not missing));
 by id;
 if missing(age) then age=_age+n-_n;
 drop n _:;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1827 views
  • 4 likes
  • 6 in conversation