Solved: Imputation age variable wide dataset

MFraga · Posted 10-04-2018 12:44 PM

Hello,

I need to impute missing values for my "age" variable. My data is organized as one observation per line. So I have multiple variables for the year of collection.

As you can see at the 6th, 7th and 8th observations, I have some inconsistencies in the declared age. So, when age is missing, I would like to consider the first age declared as the good to fill all the other missing values for that observation.

Here it is an example of the structure of my dataset:

data have;

input id age1990 age1991 age1992 age1993 age1994 age1995 age1996 age1997 age1998 age1999;

datalines;

1 22 23 24 25 26 27 28 29 30 31

2 . . 35 . . 38 39 . . .

3 . 15 16 17 18 19 20 21 22 23

3 58 59 60 61 62 63 64 65 66 .

4 . . . . . 24 25 26 27 28

5 . . . . . . . 39 . .

6 . 38 72 47 68 . 80 22 . .

7 22 . . . . . . . . 79

8 . . . . . . . . 57 89

9 . . . . . . . . . 30;

run;

Here it is what I want as a result would be:

data want;

input id age1990 age1991 age1992 age1993 age1994 age1995 age1996 age1997 age1998 age1999;

datalines;

1 22 23 24 25 26 27 28 29 30 31

2 33 34 35 36 37 38 39 40 41 42

3 14 15 16 17 18 19 20 21 22 23

3 58 59 60 61 62 63 64 65 66 67

4 19 20 21 22 23 24 25 26 27 28

5 32 33 34 35 36 37 38 39 40 41

6 37 38 39 40 41 42 43 44 45 46

7 22 23 24 25 26 27 28 29 30 31

8 49 50 51 52 53 54 55 56 57 58

9 21 22 23 24 25 26 27 28 29 30;

run;

I have tried this program using the DO function, but it does not work:

data want;

set have;

array age (10) age1990-age1999;

do i = 1 to 10;

if age(i) = . then age(i) = age(i+1) - 1;

end;

run;

Does anyone have a clue about how I could make this imputation ? Thank you!

novinosrin · Posted 10-04-2018 03:32 PM

Hi @MFraga

First off, my sincere apologies for the previous thread. I request you to kindly ignore. I overlooked the requirement and I beg your pardon. Turns out it is really easy.

I think I have nailed it this time. Test the below which works for your latest sample.

data have;
input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;
datalines;
58 1 57 56 55 54 53 52 51 50 49
. 2 . . 39 38 . . 35 . .
. 3 23 22 21 20 19 18 17 16 15
. 4 66 65 64 63 62 61 60 59 58
28 5 27 26 25 24 . . . . .
. 6 . 39 . . . . . . .
. 7 . 22 80 . 68 47 72 38 .
79 8 . . . . . . . . 22
89 9 57 . . . . . . . .
30 10 . . . . . . . . .
;
run;

data want;
set have;
array rev[*] age1990-age1999;
array j age1990-age1999;
k=coalesce(of rev(*));
k1=whichn(k,of rev(*));
k2=rev(k1)- (k1-1);
do over j;
j=k2;
k2+1;
end;
drop k:;
run;

View solution in original post

novinosrin · Posted 10-04-2018 01:20 PM

data have;

input id age1990 age1991 age1992 age1993 age1994 age1995 age1996 age1997 age1998 age1999;

datalines;
1 22 23 24 25 26 27 28 29 30 31
2 . . 35 . . 38 39 . . .
3 . 15 16 17 18 19 20 21 22 23
3 58 59 60 61 62 63 64 65 66 .
4 . . . . . 24 25 26 27 28
5 . . . . . . . 39 . .
6 . 38 72 47 68 . 80 22 . .
7 22 . . . . . . . . 79
8 . . . . . . . . 57 89
9 . . . . . . . . . 30
;

run;
data want;
set have;
array t age:;
array j age:;
k=coalesce(of t(*));
k1=whichn(k,of t(*));
k2=t(k1)- (k1-1);
do over j;
j=k2;
k2+1;
end;
drop k:;
run;

MFraga · Posted 10-04-2018 01:54 PM

Hi, novinosrin,

Thanks for your help, but it does not work. SAS sends a error message:

ERROR: array subscript out of rang at line 521 colonne 6. I have more than 300000 observations in my dataset. Do you have any other clue why it does not work?

novinosrin · Posted 10-04-2018 01:55 PM

Is your sample a good representative of you real?

300000 is nothing in my opinion.

I am wondering if you correctly implemented the code

MFraga · Posted 10-04-2018 01:59 PM

Yes, it is. The only difference I face is the variable age arrangement. When I ckeck the table, it does not follow the exact order like that :

id age1990 age1991 age1992 ...

Actually, it is like that:

age1999 id age1998 age1997 age1996 age1995 age1994...

and my dataset has many other variables in between id and ages variables.

novinosrin · Posted 10-04-2018 02:01 PM

Well, do any of those vars have prefix age besides the variables we need?

That could kill the array

Plus if your variable age arrangement is not in order, the sequence logic will anyway defeat the purpose unless we sort the array with call sortn and modify the logic.

That;s the problem as it leads me to guess as I do not know your data

MFraga · Posted 10-04-2018 02:07 PM

Yes, do have other variables with the age prefix. But I created a new table only with the id and age1990-1999 variables and I tried to run your conding, but it does not work neither...

novinosrin · Posted 10-04-2018 02:08 PM

Please simulate the new table you created with some fake data and post it here. Let me try from my end

MFraga · Posted 10-04-2018 02:45 PM

So, I need have a table only with ID and AGEs variables. After this imputation process, I will merge the new table with the older one, dropping the old AGEs variables. My new table is like that:

data have;

input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;

datalines;

58 1 57 56 55 54 53 52 51 50 49

. 2 . . 39 38 . . 35 . .

. 3 23 22 21 20 19 18 17 16 15

. 4 66 65 64 63 62 61 60 59 58

28 5 27 26 25 24 . . . . .

. 6 . 39 . . . . . . .

. 7 . 22 80 . 68 47 72 38 .

79 8 . . . . . . . . 22

89 9 57 . . . . . . . .

30 10 . . . . . . . . .

;

run;

Make special atention to the 7th, 8th and 9th observations. I have some inconsistencies in the declaration of age. Therefore, I intend to use the first age declared in time (the first year that age is declared) to correct the inconsistencies in the imputation process of age variable. What I want would be:

data want;

input id age1999 age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;

datalines;

1 58 57 56 55 54 53 52 51 50 49

2 42 41 40 39 38 37 36 35 34 33

3 24 23 22 21 20 19 18 17 16 15

4 67 66 65 64 63 62 61 60 59 58

5 28 27 26 25 24 23 22 21 20 19

6 41 40 39 38 37 36 35 34 33 32

7 46 45 44 43 42 41 40 39 38 37

8 31 30 29 28 27 26 25 24 23 22

9 58 57 56 55 54 53 52 51 50 49

10 30 29 28 27 26 25 24 23 22 21

;

run;

Thank you again!

novinosrin · Posted 10-04-2018 02:59 PM

Thank you, Can you confirm this order plz

input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990

It looks just a reverse sequence, so biggie but I would like you to confirm to avoid going back and forth.

MFraga · Posted 10-04-2018 03:02 PM

Yes, I confirm. My variables are organized in a reverse sequence.

novinosrin · Posted 10-04-2018 03:16 PM

Hi @MFraga I trust our communication skills aren't bad 🙂

here you go:

data have;
input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;
datalines;
58 1 57 56 55 54 53 52 51 50 49
. 2 . . 39 38 . . 35 . .
. 3 23 22 21 20 19 18 17 16 15
. 4 66 65 64 63 62 61 60 59 58
28 5 27 26 25 24 . . . . .
. 6 . 39 . . . . . . .
. 7 . 22 80 . 68 47 72 38 .
79 8 . . . . . . . . 22
89 9 57 . . . . . . . .
30 10 . . . . . . . . .
;
run;

data want;
set have;
array rev[*] age1990-age1999;
array t age:;
array j age:;
call sortn(of t(*));
k=coalesce(of t(*));
k1=whichn(k,of t(*));
k2=t(k1)- (k1-1);
do over j;
j=k2;
k2+1;
end;
call sortn(of rev[*]);
drop k:;
run;

I think I still have to modify the logic, let me try and come back

novinosrin · Posted 10-04-2018 03:32 PM

Hi @MFraga

First off, my sincere apologies for the previous thread. I request you to kindly ignore. I overlooked the requirement and I beg your pardon. Turns out it is really easy.

I think I have nailed it this time. Test the below which works for your latest sample.

data have;
input age1999 id age1998 age1997 age1996 age1995 age1994 age1993 age1992 age1991 age1990;
datalines;
58 1 57 56 55 54 53 52 51 50 49
. 2 . . 39 38 . . 35 . .
. 3 23 22 21 20 19 18 17 16 15
. 4 66 65 64 63 62 61 60 59 58
28 5 27 26 25 24 . . . . .
. 6 . 39 . . . . . . .
. 7 . 22 80 . 68 47 72 38 .
79 8 . . . . . . . . 22
89 9 57 . . . . . . . .
30 10 . . . . . . . . .
;
run;

data want;
set have;
array rev[*] age1990-age1999;
array j age1990-age1999;
k=coalesce(of rev(*));
k1=whichn(k,of rev(*));
k2=rev(k1)- (k1-1);
do over j;
j=k2;
k2+1;
end;
drop k:;
run;

MFraga · Posted 10-04-2018 03:47 PM

It works! Thank you so much for your help, @novinosrin

novinosrin · Posted 10-04-2018 03:48 PM

Pleasure is all mine. Thank you for the nice question. Have fun and take care :)!

Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Re: Imputation age variable wide dataset

Registration is open