Hello,
I need to impute missing values for my "age" variable. My data is organized as one observation per line. So I have multiple variables for the year of collection.
As you can see at the 6th, 7th and 8th observations, I have some inconsistencies in the declared age. So, when age is missing, I would like to consider the first age declared as the good to fill all the other missing values for that observation.
Here it is an example of the structure of my dataset:
data have;
input id age1990 age1991 age1992 age1993 age1994 age1995 age1996 age1997 age1998 age1999;
datalines;
1 22 23 24 25 26 27 28 29 30 31
2 . . 35 . . 38 39 . . .
3 . 15 16 17 18 19 20 21 22 23
3 58 59 60 61 62 63 64 65 66 .
4 . . . . . 24 25 26 27 28
5 . . . . . . . 39 . .
6 . 38 72 47 68 . 80 22 . .
7 22 . . . . . . . . 79
8 . . . . . . . . 57 89
9 . . . . . . . . . 30;
run;
Here it is what I want as a result would be:
data want;
input id age1990 age1991 age1992 age1993 age1994 age1995 age1996 age1997 age1998 age1999;
datalines;
1 22 23 24 25 26 27 28 29 30 31
2 33 34 35 36 37 38 39 40 41 42
3 14 15 16 17 18 19 20 21 22 23
3 58 59 60 61 62 63 64 65 66 67
4 19 20 21 22 23 24 25 26 27 28
5 32 33 34 35 36 37 38 39 40 41
6 37 38 39 40 41 42 43 44 45 46
7 22 23 24 25 26 27 28 29 30 31
8 49 50 51 52 53 54 55 56 57 58
9 21 22 23 24 25 26 27 28 29 30;
run;
I have tried this program using the DO function, but it does not work:
data want;
set have;
array age (10) age1990-age1999;
do i = 1 to 10;
if age(i) = . then age(i) = age(i+1) - 1;
end;
run;
Does anyone have a clue about how I could make this imputation ? Thank you!
Hi @MFraga
First off, my sincere apologies for the previous thread. I request you to kindly ignore. I overlooked the requirement and I beg your pardon. Turns out it is really easy.
I think I have nailed it this time. Test the below which works for your latest sample.
data have;
input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;
datalines;
58 1 57 56 55 54 53 52 51 50 49
. 2 . . 39 38 . . 35 . .
. 3 23 22 21 20 19 18 17 16 15
. 4 66 65 64 63 62 61 60 59 58
28 5 27 26 25 24 . . . . .
. 6 . 39 . . . . . . .
. 7 . 22 80 . 68 47 72 38 .
79 8 . . . . . . . . 22
89 9 57 . . . . . . . .
30 10 . . . . . . . . .
;
run;
data want;
set have;
array rev[*] age1990-age1999;
array j age1990-age1999;
k=coalesce(of rev(*));
k1=whichn(k,of rev(*));
k2=rev(k1)- (k1-1);
do over j;
j=k2;
k2+1;
end;
drop k:;
run;
data have;
input id age1990 age1991 age1992 age1993 age1994 age1995 age1996 age1997 age1998 age1999;
datalines;
1 22 23 24 25 26 27 28 29 30 31
2 . . 35 . . 38 39 . . .
3 . 15 16 17 18 19 20 21 22 23
3 58 59 60 61 62 63 64 65 66 .
4 . . . . . 24 25 26 27 28
5 . . . . . . . 39 . .
6 . 38 72 47 68 . 80 22 . .
7 22 . . . . . . . . 79
8 . . . . . . . . 57 89
9 . . . . . . . . . 30
;
run;
data want;
set have;
array t age:;
array j age:;
k=coalesce(of t(*));
k1=whichn(k,of t(*));
k2=t(k1)- (k1-1);
do over j;
j=k2;
k2+1;
end;
drop k:;
run;
Hi, novinosrin,
Thanks for your help, but it does not work. SAS sends a error message:
ERROR: array subscript out of rang at line 521 colonne 6. I have more than 300000 observations in my dataset. Do you have any other clue why it does not work?
Is your sample a good representative of you real?
300000 is nothing in my opinion.
I am wondering if you correctly implemented the code
Yes, it is. The only difference I face is the variable age arrangement. When I ckeck the table, it does not follow the exact order like that :
id age1990 age1991 age1992 ...
Actually, it is like that:
age1999 id age1998 age1997 age1996 age1995 age1994...
and my dataset has many other variables in between id and ages variables.
Well, do any of those vars have prefix age besides the variables we need?
That could kill the array
Plus if your variable age arrangement is not in order, the sequence logic will anyway defeat the purpose unless we sort the array with call sortn and modify the logic.
That;s the problem as it leads me to guess as I do not know your data
Yes, do have other variables with the age prefix. But I created a new table only with the id and age1990-1999 variables and I tried to run your conding, but it does not work neither...
Please simulate the new table you created with some fake data and post it here. Let me try from my end
So, I need have a table only with ID and AGEs variables. After this imputation process, I will merge the new table with the older one, dropping the old AGEs variables. My new table is like that:
data have;
input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;
datalines;
58 1 57 56 55 54 53 52 51 50 49
. 2 . . 39 38 . . 35 . .
. 3 23 22 21 20 19 18 17 16 15
. 4 66 65 64 63 62 61 60 59 58
28 5 27 26 25 24 . . . . .
. 6 . 39 . . . . . . .
. 7 . 22 80 . 68 47 72 38 .
79 8 . . . . . . . . 22
89 9 57 . . . . . . . .
30 10 . . . . . . . . .
;
run;
Make special atention to the 7th, 8th and 9th observations. I have some inconsistencies in the declaration of age. Therefore, I intend to use the first age declared in time (the first year that age is declared) to correct the inconsistencies in the imputation process of age variable. What I want would be:
data want;
input id age1999 age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;
datalines;
1 58 57 56 55 54 53 52 51 50 49
2 42 41 40 39 38 37 36 35 34 33
3 24 23 22 21 20 19 18 17 16 15
4 67 66 65 64 63 62 61 60 59 58
5 28 27 26 25 24 23 22 21 20 19
6 41 40 39 38 37 36 35 34 33 32
7 46 45 44 43 42 41 40 39 38 37
8 31 30 29 28 27 26 25 24 23 22
9 58 57 56 55 54 53 52 51 50 49
10 30 29 28 27 26 25 24 23 22 21
;
run;
Thank you again!
Thank you, Can you confirm this order plz
input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990
It looks just a reverse sequence, so biggie but I would like you to confirm to avoid going back and forth.
Yes, I confirm. My variables are organized in a reverse sequence.
Hi @MFraga I trust our communication skills aren't bad 🙂
here you go:
data have;
input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;
datalines;
58 1 57 56 55 54 53 52 51 50 49
. 2 . . 39 38 . . 35 . .
. 3 23 22 21 20 19 18 17 16 15
. 4 66 65 64 63 62 61 60 59 58
28 5 27 26 25 24 . . . . .
. 6 . 39 . . . . . . .
. 7 . 22 80 . 68 47 72 38 .
79 8 . . . . . . . . 22
89 9 57 . . . . . . . .
30 10 . . . . . . . . .
;
run;
data want;
set have;
array rev[*] age1990-age1999;
array t age:;
array j age:;
call sortn(of t(*));
k=coalesce(of t(*));
k1=whichn(k,of t(*));
k2=t(k1)- (k1-1);
do over j;
j=k2;
k2+1;
end;
call sortn(of rev[*]);
drop k:;
run;
I think I still have to modify the logic, let me try and come back
Hi @MFraga
First off, my sincere apologies for the previous thread. I request you to kindly ignore. I overlooked the requirement and I beg your pardon. Turns out it is really easy.
I think I have nailed it this time. Test the below which works for your latest sample.
data have;
input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;
datalines;
58 1 57 56 55 54 53 52 51 50 49
. 2 . . 39 38 . . 35 . .
. 3 23 22 21 20 19 18 17 16 15
. 4 66 65 64 63 62 61 60 59 58
28 5 27 26 25 24 . . . . .
. 6 . 39 . . . . . . .
. 7 . 22 80 . 68 47 72 38 .
79 8 . . . . . . . . 22
89 9 57 . . . . . . . .
30 10 . . . . . . . . .
;
run;
data want;
set have;
array rev[*] age1990-age1999;
array j age1990-age1999;
k=coalesce(of rev(*));
k1=whichn(k,of rev(*));
k2=rev(k1)- (k1-1);
do over j;
j=k2;
k2+1;
end;
drop k:;
run;
It works! Thank you so much for your help, @novinosrin
Pleasure is all mine. Thank you for the nice question. Have fun and take care :)!
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.