Solved: Re: Conditionally replace missing with means

lizzy28 · Posted 08-16-2016 03:26 PM

Hi everyone,

I am trying to standardize my data by replacing missing with means. However, only when the number of missing for a series of variables is less than half of the number of the series of variables (that is why I could not use proc standard). The data I have looks like below:

id	a1	a2	a3	a4	b1	b2	b3	b4	b5
1	1	5	4	1	1	3	2	3	4
2	4	1	3	5	1	4	3	3	1
3	2	3		3	3	3	3	2	3
4	3	3	2	5	2	3	1	1	1
5	3	3	2	4	4	4	1	3	2
6	1		2	1		3	2	3	4
7	3	2	1	4	2	2	4		4
8	3	5	4	3	1	1	3	4	4
9	1	1	2	5	4	2	4	1	3
10	1	3	5	5	1	1	2	4	2
11	3				1	4	2	4	4
12	4	5	1	5	3	2	2	3	4
13	1	3	4	2	2	2	1	3	1
14	3	5	1	2	3		1	4	4
15	5	2	5	3	1	1	1	1	2
16					2	2	3	4	1
17	4	4	5	1	2	1		1	4
18	4	1	2	2	1	1	4	1	3
19	5	5	1	2	2	1	1	3	3
20	2	4	4	5	3	3	3	2	3

I have to leave missing the values of a2, a3, a4 for id#11 and the values of a1-a4 for id #16 because nmiss(of a1-a4) is more than half of the number of variables. The rest of missing values, a3 for id#4,a2 for id#6, b1 for id#6, b4 for id#7, ... can be replaced by the means of the corresponding variable.

I start my code like this:

data want;set have;

array avar_all a1-a4;

array bvar_all b1-b5;

DO i=1 to dim(avar_all);

IF nmiss(of a1-a4)<4 THEN DO;

IF avar_all[i]=. THEN avar_all[i]=mean(???);

END;

DO i=1 to dim(bvar_all);

IF nmiss(of b1-b5)<4 THEN DO;

IF bvar_all[i]=. THEN bvar_all[i]=mean(???);

END;

run;

Anyone knows what I should put in (???)?

Thanks a lot!

ChrisNZ · Posted 08-16-2016 06:30 PM

Like this? [Edited as I wasn't computing a column average]

data HAVE;
input ID $3. (A1 A2 A3 A4 B1 B2 B3 B4 B5) (2.); 
cards;
1  1 5 4 1 1 3 2 3 4 
2  4 1 3 5 1 4 3 3 1 
3  2 3   3 3 3 3 2 3 
4  3 3 2 5 2 3 1 1 1 
5  3 3 2 4 4 4 1 3 2 
6  1   2 1   3 2 3 4 
7  3 2 1 4 2 2 4   4 
8  3 5 4 3 1 1 3 4 4 
9  1 1 2 5 4 2 4 1 3 
10 1 3 5 5 1 1 2 4 2 
11 3       1 4 2 4 4 
12 4 5 1 5 3 2 2 3 4 
13 1 3 4 2 2 2 1 3 1 
14 3 5 1 2 3   1 4 4 
15 5 2 5 3 1 1 1 1 2 
16         2 2 3 4 1 
17 4 4 5 1 2 1   1 4 
18 4 1 2 2 1 1 4 1 3 
19 5 5 1 2 2 1 1 3 3 
20 2 4 4 5 3 3 3 2 3 
run;
proc means data=HAVE(keep=A: B:) noprint; 
  output out=MEANS(drop=_:) mean=/autoname;
run;   
data WANT;
  set HAVE ;
  retain P 1;
  set MEANS point=P;
  array A [*] A1-A4 ;
  array B [*] B1-B5 ;
  keep ID A1-A4 B1-B5 ;
  if 0 < nmiss(of A[*]) < n(of A[*]) then do I=1 to dim(A);
    A[I]= coalesce( A[I], vvaluex(cats('A',I,'_MEAN' )));
  end;
  if 0 < nmiss(of B[*]) < n(of B[*]) then do I=1 to dim(B);
    B[I]= coalesce( B[I], vvaluex(cats('B',I,'_MEAN' )));
  end;
run;

Function whichn() cannot find a missing value, so we have to loop through the whole array to replace values sadly.

id	a1	a2	a3	a4	b1	b2	b3	b4	b5
1	1	5	4	1	1	3	2	3	4
2	4	1	3	5	1	4	3	3	1
3	2	3	2.82	3	3	3	3	2	3
4	3	3	2	5	2	3	1	1	1
5	3	3	2	4	4	4	1	3	2
6	1	3.24	2	1	2.05	3	2	3	4
7	3	2	1	4	2	2	4	2.63	4
8	3	5	4	3	1	1	3	4	4
9	1	1	2	5	4	2	4	1	3
10	1	3	5	5	1	1	2	4	2
11	3	.	.	.	1	4	2	4	4
12	4	5	1	5	3	2	2	3	4

High-Performance SAS Coding - Third Edition

View solution in original post

KachiM · Posted 08-16-2016 03:56 PM

IF avar_all[i]=. THEN avar_all[i]=mean(???);

Use mean(of avar_all[*]);

IF bvar_all[i]=. THEN bvar_all[i]=mean(???);

Use mean(of bvar_all[*]);

lizzy28 · Posted 08-16-2016 04:19 PM

Thank you!

mean(of avar_all[*]) gives the mean of across-variable, a1-a4. However, I need to replace the missing with the mean of other values of the same variable (just like proc standard does).

For example, a3 value for id#3 should be replaced by the mean of a3, 2.82.

KachiM · Posted 08-16-2016 05:18 PM

I don't see how you are getting 2.82

Give the individual values as how you are deriving it.

ChrisNZ · Posted 08-16-2016 06:30 PM

Like this? [Edited as I wasn't computing a column average]

data HAVE;
input ID $3. (A1 A2 A3 A4 B1 B2 B3 B4 B5) (2.); 
cards;
1  1 5 4 1 1 3 2 3 4 
2  4 1 3 5 1 4 3 3 1 
3  2 3   3 3 3 3 2 3 
4  3 3 2 5 2 3 1 1 1 
5  3 3 2 4 4 4 1 3 2 
6  1   2 1   3 2 3 4 
7  3 2 1 4 2 2 4   4 
8  3 5 4 3 1 1 3 4 4 
9  1 1 2 5 4 2 4 1 3 
10 1 3 5 5 1 1 2 4 2 
11 3       1 4 2 4 4 
12 4 5 1 5 3 2 2 3 4 
13 1 3 4 2 2 2 1 3 1 
14 3 5 1 2 3   1 4 4 
15 5 2 5 3 1 1 1 1 2 
16         2 2 3 4 1 
17 4 4 5 1 2 1   1 4 
18 4 1 2 2 1 1 4 1 3 
19 5 5 1 2 2 1 1 3 3 
20 2 4 4 5 3 3 3 2 3 
run;
proc means data=HAVE(keep=A: B:) noprint; 
  output out=MEANS(drop=_:) mean=/autoname;
run;   
data WANT;
  set HAVE ;
  retain P 1;
  set MEANS point=P;
  array A [*] A1-A4 ;
  array B [*] B1-B5 ;
  keep ID A1-A4 B1-B5 ;
  if 0 < nmiss(of A[*]) < n(of A[*]) then do I=1 to dim(A);
    A[I]= coalesce( A[I], vvaluex(cats('A',I,'_MEAN' )));
  end;
  if 0 < nmiss(of B[*]) < n(of B[*]) then do I=1 to dim(B);
    B[I]= coalesce( B[I], vvaluex(cats('B',I,'_MEAN' )));
  end;
run;

Function whichn() cannot find a missing value, so we have to loop through the whole array to replace values sadly.

id	a1	a2	a3	a4	b1	b2	b3	b4	b5
1	1	5	4	1	1	3	2	3	4
2	4	1	3	5	1	4	3	3	1
3	2	3	2.82	3	3	3	3	2	3
4	3	3	2	5	2	3	1	1	1
5	3	3	2	4	4	4	1	3	2
6	1	3.24	2	1	2.05	3	2	3	4
7	3	2	1	4	2	2	4	2.63	4
8	3	5	4	3	1	1	3	4	4
9	1	1	2	5	4	2	4	1	3
10	1	3	5	5	1	1	2	4	2
11	3	.	.	.	1	4	2	4	4
12	4	5	1	5	3	2	2	3	4

High-Performance SAS Coding - Third Edition

Reeza · Posted 08-16-2016 08:25 PM

@lizzy28 I'm curious whats the logic between looking at the number of missing row wise, but then replacing the value column wise.

lizzy28 · Posted 08-17-2016 09:05 AM

Thanks, Reeza.
It is a survey data I'm working on. For each measurement, there are multiple items. For example, a1-a4 is to measure individuals' mental health, and b1-b5 is to measure physical health. The sum of a1-a4 is the final score for a person. Therefore, if one person only completed one item out of four, the person cannot be counted as a valid observation.

Reeza · Posted 08-16-2016 05:16 PM

Do multiple steps.

First idenrify the variables to be standardized by getting the number of missing per column. Create a list of the variable names from that and pass it to proc standardize. Here's a sketch (untested).

Proc means data =have n nmiss stackods;

var a1-a4 b1-b5;

ODS output summary=stats;

run;

data var_list;

set stats end=EEof;

length var_list $250.;

retain var_list;

if nmiss/n > 0.5 the do;

var_list = catx(" ", var_list, name);

end;

if eeof then call symputx('var_list', var_list);

run;

proc standardize...

var &var_list;

...

run;

id	a1	a2	a3	a4	b1	b2	b3	b4	b5
1	1	5	4	1	1	3	2	3	4
2	4	1	3	5	1	4	3	3	1
3	2	3		3	3	3	3	2	3
4	3	3	2	5	2	3	1	1	1
5	3	3	2	4	4	4	1	3	2
6	1		2	1		3	2	3	4
7	3	2	1	4	2	2	4		4
8	3	5	4	3	1	1	3	4	4
9	1	1	2	5	4	2	4	1	3
10	1	3	5	5	1	1	2	4	2
11	3				1	4	2	4	4
12	4	5	1	5	3	2	2	3	4
13	1	3	4	2	2	2	1	3	1
14	3	5	1	2	3		1	4	4
15	5	2	5	3	1	1	1	1	2
16					2	2	3	4	1
17	4	4	5	1	2	1		1	4
18	4	1	2	2	1	1	4	1	3
19	5	5	1	2	2	1	1	3	3
20	2	4	4	5	3	3	3	2	3

id	a1	a2	a3	a4	b1	b2	b3	b4	b5
1	1	5	4	1	1	3	2	3	4
2	4	1	3	5	1	4	3	3	1
3	2	3		3	3	3	3	2	3
4	3	3	2	5	2	3	1	1	1
5	3	3	2	4	4	4	1	3	2
6	1		2	1		3	2	3	4
7	3	2	1	4	2	2	4		4
8	3	5	4	3	1	1	3	4	4
9	1	1	2	5	4	2	4	1	3
10	1	3	5	5	1	1	2	4	2
11	3				1	4	2	4	4
12	4	5	1	5	3	2	2	3	4
13	1	3	4	2	2	2	1	3	1
14	3	5	1	2	3		1	4	4
15	5	2	5	3	1	1	1	1	2
16					2	2	3	4	1
17	4	4	5	1	2	1		1	4
18	4	1	2	2	1	1	4	1	3
19	5	5	1	2	2	1	1	3	3
20	2	4	4	5	3	3	3	2	3

Classroom Training Available!

id	a1	a2	a3	a4	b1	b2	b3	b4	b5
1	1	5	4	1	1	3	2	3	4
2	4	1	3	5	1	4	3	3	1
3	2	3		3	3	3	3	2	3
4	3	3	2	5	2	3	1	1	1
5	3	3	2	4	4	4	1	3	2
6	1		2	1		3	2	3	4
7	3	2	1	4	2	2	4		4
8	3	5	4	3	1	1	3	4	4
9	1	1	2	5	4	2	4	1	3
10	1	3	5	5	1	1	2	4	2
11	3				1	4	2	4	4
12	4	5	1	5	3	2	2	3	4
13	1	3	4	2	2	2	1	3	1
14	3	5	1	2	3		1	4	4
15	5	2	5	3	1	1	1	1	2
16					2	2	3	4	1
17	4	4	5	1	2	1		1	4
18	4	1	2	2	1	1	4	1	3
19	5	5	1	2	2	1	1	3	3
20	2	4	4	5	3	3	3	2	3