BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MFraga
Quartz | Level 8

Hello,

 

I need to impute missing values for my "age" variable. My data is organized as one observation per line. So I have multiple variables for the year of collection.

 

As you can see at the 6th, 7th and 8th observations, I have some inconsistencies in the declared age. So, when age is missing, I would like to consider the first age declared as the good to fill all the other missing values for that observation.

 

Here it is an example of the structure of my dataset:

 

data have;

input id age1990 age1991 age1992 age1993 age1994 age1995 age1996 age1997 age1998 age1999;

datalines;

1 22 23 24 25 26 27 28 29 30 31

2 . . 35 . . 38 39 . . .

3 . 15 16 17 18 19 20 21 22 23

3 58 59 60 61 62 63 64 65 66 .

4 . . . . . 24 25 26 27 28

5 . . . . . . . 39 . .

6 . 38 72 47 68 . 80 22 . .

7 22 . . . . . . . . 79

8 . . . . . . . . 57 89

9 . . . . . . . . . 30;

run;

 

 

Here it is what I want as a result would be:

 

 

data want;

input id age1990 age1991 age1992 age1993 age1994 age1995 age1996 age1997 age1998 age1999;

datalines;

1 22 23 24 25 26 27 28 29 30 31

2 33 34 35 36 37 38 39 40 41 42

3 14 15 16 17 18 19 20 21 22 23

3 58 59 60 61 62 63 64 65 66 67

4 19 20 21 22 23 24 25 26 27 28

5 32 33 34 35 36 37 38 39 40 41

6 37 38 39 40 41 42 43 44 45 46

7 22 23 24 25 26 27 28 29 30 31

8 49 50 51 52 53 54 55 56 57 58

9 21 22 23 24 25 26 27 28 29 30;

run;

 

 

 

I have tried this program using the DO function, but it does not work:

 

data want;

set have;

 

array age (10) age1990-age1999;

do i = 1 to 10;

 

if age(i) = . then age(i) = age(i+1) - 1;

end;

run;

 

 

Does anyone have a clue about how I could make this imputation ? Thank you!

 

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Hi @MFraga  

First off, my sincere apologies for the previous thread. I request you to kindly ignore. I overlooked the requirement and I beg your pardon. Turns out it is really easy.

I think I have nailed it this time. Test the below which works for your latest sample.

 

data have;
input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;
datalines;
58 1 57 56 55 54 53 52 51 50 49
. 2 . . 39 38 . . 35 . .
. 3 23 22 21 20 19 18 17 16 15
. 4 66 65 64 63 62 61 60 59 58
28 5 27 26 25 24 . . . . .
. 6 . 39 . . . . . . .
. 7 . 22 80 . 68 47 72 38 .
79 8 . . . . . . . . 22
89 9 57 . . . . . . . .
30 10 . . . . . . . . .
;
run;

data want;
set have;
array rev[*] age1990-age1999;
array j age1990-age1999;
k=coalesce(of rev(*));
k1=whichn(k,of rev(*));
k2=rev(k1)- (k1-1);
do over j;
j=k2;
k2+1;
end;
drop k:;
run;

View solution in original post

14 REPLIES 14
novinosrin
Tourmaline | Level 20
data have;

input id age1990 age1991 age1992 age1993 age1994 age1995 age1996 age1997 age1998 age1999;

datalines;
1 22 23 24 25 26 27 28 29 30 31
2 . . 35 . . 38 39 . . .
3 . 15 16 17 18 19 20 21 22 23
3 58 59 60 61 62 63 64 65 66 .
4 . . . . . 24 25 26 27 28
5 . . . . . . . 39 . .
6 . 38 72 47 68 . 80 22 . .
7 22 . . . . . . . . 79
8 . . . . . . . . 57 89
9 . . . . . . . . . 30
;

run;
data want;
set have;
array t age:;
array j age:;
k=coalesce(of t(*));
k1=whichn(k,of t(*));
k2=t(k1)- (k1-1);
do over j;
j=k2;
k2+1;
end;
drop k:;
run;

MFraga
Quartz | Level 8

Hi, novinosrin,

 

Thanks for your help, but it does not work. SAS sends a error message:

 

ERROR: array subscript out of rang at line 521 colonne 6. I have more than 300000 observations in my dataset. Do you have any other clue why it does not work?

 

 

novinosrin
Tourmaline | Level 20

Is your sample a good representative of you real?

 

300000  is nothing in my opinion. 

 

I am wondering if you correctly implemented the code

MFraga
Quartz | Level 8

Yes, it is. The only difference I face is the variable age arrangement. When I ckeck the table, it does not follow the exact order like that :

 

id age1990 age1991 age1992 ...

 

Actually, it is like that:

 

age1999 id age1998 age1997 age1996 age1995 age1994...

 

and my dataset has many other variables in between id and ages variables.

novinosrin
Tourmaline | Level 20

Well, do any of those vars have prefix age besides the variables we need?

 

That could kill the array

 

Plus if your variable age arrangement is not in order, the sequence logic will anyway defeat the purpose unless we sort the array with call sortn and modify the logic.

 

That;s the problem as it leads me to guess as I do not know your data

MFraga
Quartz | Level 8

Yes,  do have other variables with the age prefix. But I created a new table only with the id and age1990-1999 variables and I tried to run your conding, but it does not work neither...

novinosrin
Tourmaline | Level 20

Please simulate the new table you created  with some fake data and post it here. Let me try from my end

MFraga
Quartz | Level 8

So, I need have a table only with ID and AGEs variables. After this imputation process, I will merge the new table with the older one, dropping the old AGEs variables. My new table is like that:

 

data have;

input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;

datalines;

 

58 1 57 56 55 54 53 52 51 50 49

. 2 . . 39 38 . . 35 . .

. 3 23 22 21 20 19 18 17 16 15

. 4 66 65 64 63 62 61 60 59 58

 

28 5 27 26 25 24 . . . . .

. 6 . 39 . . . . . . .

. 7 . 22 80 . 68 47 72 38 .

79 8 . . . . . . . . 22

89 9 57 . . . . . . . .

30 10 . . . . . . . . .

;

 

run;

 

 

Make special atention to the 7th, 8th and 9th observations. I have some inconsistencies in the declaration of age. Therefore, I intend to use the first age declared in time (the first year that age is declared) to correct the inconsistencies in the imputation process of age variable. What I want would be:

 

data want;

input id age1999 age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;

datalines;

 

1 58 57 56 55 54 53 52 51 50 49

2 42 41 40 39 38 37 36 35 34 33

3 24 23 22 21 20 19 18 17 16 15

4 67 66 65 64 63 62 61 60 59 58

5 28 27 26 25 24 23 22 21 20 19

6 41 40 39 38 37 36 35 34 33 32

7 46 45 44 43 42 41 40 39 38 37

8 31 30 29 28 27 26 25 24 23 22

9 58 57 56 55 54 53 52 51 50 49

10 30 29 28 27 26 25 24 23 22 21

;

run;

 

 

Thank you again!

 

 

novinosrin
Tourmaline | Level 20

Thank you, Can you confirm this order plz

 

input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990

 

It looks just a reverse sequence, so biggie but I would like you to confirm to avoid going back and forth.

 

 

 

MFraga
Quartz | Level 8

Yes, I confirm. My variables are organized in a reverse sequence.

novinosrin
Tourmaline | Level 20

Hi @MFraga I trust our communication skills aren't bad 🙂

 

here you go:

 

data have;
input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;
datalines;
58 1 57 56 55 54 53 52 51 50 49
. 2 . . 39 38 . . 35 . .
. 3 23 22 21 20 19 18 17 16 15
. 4 66 65 64 63 62 61 60 59 58
28 5 27 26 25 24 . . . . .
. 6 . 39 . . . . . . .
. 7 . 22 80 . 68 47 72 38 .
79 8 . . . . . . . . 22
89 9 57 . . . . . . . .
30 10 . . . . . . . . .
;
run;

data want;
set have;
array rev[*] age1990-age1999;
array t age:;
array j age:;
call sortn(of t(*));
k=coalesce(of t(*));
k1=whichn(k,of t(*));
k2=t(k1)- (k1-1);
do over j;
j=k2;
k2+1;
end;
call sortn(of rev[*]);
drop k:;
run;
 
 

 

I think I still have to modify the logic, let me try and come back 

novinosrin
Tourmaline | Level 20

Hi @MFraga  

First off, my sincere apologies for the previous thread. I request you to kindly ignore. I overlooked the requirement and I beg your pardon. Turns out it is really easy.

I think I have nailed it this time. Test the below which works for your latest sample.

 

data have;
input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;
datalines;
58 1 57 56 55 54 53 52 51 50 49
. 2 . . 39 38 . . 35 . .
. 3 23 22 21 20 19 18 17 16 15
. 4 66 65 64 63 62 61 60 59 58
28 5 27 26 25 24 . . . . .
. 6 . 39 . . . . . . .
. 7 . 22 80 . 68 47 72 38 .
79 8 . . . . . . . . 22
89 9 57 . . . . . . . .
30 10 . . . . . . . . .
;
run;

data want;
set have;
array rev[*] age1990-age1999;
array j age1990-age1999;
k=coalesce(of rev(*));
k1=whichn(k,of rev(*));
k2=rev(k1)- (k1-1);
do over j;
j=k2;
k2+1;
end;
drop k:;
run;
MFraga
Quartz | Level 8

It works! Thank you so much for your help, @novinosrin

novinosrin
Tourmaline | Level 20

Pleasure is all mine. Thank you for the nice question. Have fun and take care :)!

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 3401 views
  • 0 likes
  • 2 in conversation