Hi everyone,
I am trying to standardize my data by replacing missing with means. However, only when the number of missing for a series of variables is less than half of the number of the series of variables (that is why I could not use proc standard). The data I have looks like below:
id | a1 | a2 | a3 | a4 | b1 | b2 | b3 | b4 | b5 |
1 | 1 | 5 | 4 | 1 | 1 | 3 | 2 | 3 | 4 |
2 | 4 | 1 | 3 | 5 | 1 | 4 | 3 | 3 | 1 |
3 | 2 | 3 | 3 | 3 | 3 | 3 | 2 | 3 | |
4 | 3 | 3 | 2 | 5 | 2 | 3 | 1 | 1 | 1 |
5 | 3 | 3 | 2 | 4 | 4 | 4 | 1 | 3 | 2 |
6 | 1 | 2 | 1 | 3 | 2 | 3 | 4 | ||
7 | 3 | 2 | 1 | 4 | 2 | 2 | 4 | 4 | |
8 | 3 | 5 | 4 | 3 | 1 | 1 | 3 | 4 | 4 |
9 | 1 | 1 | 2 | 5 | 4 | 2 | 4 | 1 | 3 |
10 | 1 | 3 | 5 | 5 | 1 | 1 | 2 | 4 | 2 |
11 | 3 | 1 | 4 | 2 | 4 | 4 | |||
12 | 4 | 5 | 1 | 5 | 3 | 2 | 2 | 3 | 4 |
13 | 1 | 3 | 4 | 2 | 2 | 2 | 1 | 3 | 1 |
14 | 3 | 5 | 1 | 2 | 3 | 1 | 4 | 4 | |
15 | 5 | 2 | 5 | 3 | 1 | 1 | 1 | 1 | 2 |
16 | 2 | 2 | 3 | 4 | 1 | ||||
17 | 4 | 4 | 5 | 1 | 2 | 1 | 1 | 4 | |
18 | 4 | 1 | 2 | 2 | 1 | 1 | 4 | 1 | 3 |
19 | 5 | 5 | 1 | 2 | 2 | 1 | 1 | 3 | 3 |
20 | 2 | 4 | 4 | 5 | 3 | 3 | 3 | 2 | 3 |
I have to leave missing the values of a2, a3, a4 for id#11 and the values of a1-a4 for id #16 because nmiss(of a1-a4) is more than half of the number of variables. The rest of missing values, a3 for id#4,a2 for id#6, b1 for id#6, b4 for id#7, ... can be replaced by the means of the corresponding variable.
I start my code like this:
data want;set have;
array avar_all a1-a4;
array bvar_all b1-b5;
DO i=1 to dim(avar_all);
IF nmiss(of a1-a4)<4 THEN DO;
IF avar_all[i]=. THEN avar_all[i]=mean(???);
END;
END;
DO i=1 to dim(bvar_all);
IF nmiss(of b1-b5)<4 THEN DO;
IF bvar_all[i]=. THEN bvar_all[i]=mean(???);
END;
END;
run;
Anyone knows what I should put in (???)?
Thanks a lot!
Like this? [Edited as I wasn't computing a column average]
data HAVE;
input ID $3. (A1 A2 A3 A4 B1 B2 B3 B4 B5) (2.);
cards;
1 1 5 4 1 1 3 2 3 4
2 4 1 3 5 1 4 3 3 1
3 2 3 3 3 3 3 2 3
4 3 3 2 5 2 3 1 1 1
5 3 3 2 4 4 4 1 3 2
6 1 2 1 3 2 3 4
7 3 2 1 4 2 2 4 4
8 3 5 4 3 1 1 3 4 4
9 1 1 2 5 4 2 4 1 3
10 1 3 5 5 1 1 2 4 2
11 3 1 4 2 4 4
12 4 5 1 5 3 2 2 3 4
13 1 3 4 2 2 2 1 3 1
14 3 5 1 2 3 1 4 4
15 5 2 5 3 1 1 1 1 2
16 2 2 3 4 1
17 4 4 5 1 2 1 1 4
18 4 1 2 2 1 1 4 1 3
19 5 5 1 2 2 1 1 3 3
20 2 4 4 5 3 3 3 2 3
run;
proc means data=HAVE(keep=A: B:) noprint;
output out=MEANS(drop=_:) mean=/autoname;
run;
data WANT;
set HAVE ;
retain P 1;
set MEANS point=P;
array A [*] A1-A4 ;
array B [*] B1-B5 ;
keep ID A1-A4 B1-B5 ;
if 0 < nmiss(of A[*]) < n(of A[*]) then do I=1 to dim(A);
A[I]= coalesce( A[I], vvaluex(cats('A',I,'_MEAN' )));
end;
if 0 < nmiss(of B[*]) < n(of B[*]) then do I=1 to dim(B);
B[I]= coalesce( B[I], vvaluex(cats('B',I,'_MEAN' )));
end;
run;
Function whichn() cannot find a missing value, so we have to loop through the whole array to replace values sadly.
id | a1 | a2 | a3 | a4 | b1 | b2 | b3 | b4 | b5 |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 5 | 4 | 1 | 1 | 3 | 2 | 3 | 4 |
2 | 4 | 1 | 3 | 5 | 1 | 4 | 3 | 3 | 1 |
3 | 2 | 3 | 2.82 | 3 | 3 | 3 | 3 | 2 | 3 |
4 | 3 | 3 | 2 | 5 | 2 | 3 | 1 | 1 | 1 |
5 | 3 | 3 | 2 | 4 | 4 | 4 | 1 | 3 | 2 |
6 | 1 | 3.24 | 2 | 1 | 2.05 | 3 | 2 | 3 | 4 |
7 | 3 | 2 | 1 | 4 | 2 | 2 | 4 | 2.63 | 4 |
8 | 3 | 5 | 4 | 3 | 1 | 1 | 3 | 4 | 4 |
9 | 1 | 1 | 2 | 5 | 4 | 2 | 4 | 1 | 3 |
10 | 1 | 3 | 5 | 5 | 1 | 1 | 2 | 4 | 2 |
11 | 3 | . | . | . | 1 | 4 | 2 | 4 | 4 |
12 | 4 | 5 | 1 | 5 | 3 | 2 | 2 | 3 | 4 |
IF avar_all[i]=. THEN avar_all[i]=mean(???);
Use mean(of avar_all[*]);
IF bvar_all[i]=. THEN bvar_all[i]=mean(???);
Use mean(of bvar_all[*]);
Thank you!
mean(of avar_all[*]) gives the mean of across-variable, a1-a4. However, I need to replace the missing with the mean of other values of the same variable (just like proc standard does).
For example, a3 value for id#3 should be replaced by the mean of a3, 2.82.
I don't see how you are getting 2.82
Give the individual values as how you are deriving it.
Like this? [Edited as I wasn't computing a column average]
data HAVE;
input ID $3. (A1 A2 A3 A4 B1 B2 B3 B4 B5) (2.);
cards;
1 1 5 4 1 1 3 2 3 4
2 4 1 3 5 1 4 3 3 1
3 2 3 3 3 3 3 2 3
4 3 3 2 5 2 3 1 1 1
5 3 3 2 4 4 4 1 3 2
6 1 2 1 3 2 3 4
7 3 2 1 4 2 2 4 4
8 3 5 4 3 1 1 3 4 4
9 1 1 2 5 4 2 4 1 3
10 1 3 5 5 1 1 2 4 2
11 3 1 4 2 4 4
12 4 5 1 5 3 2 2 3 4
13 1 3 4 2 2 2 1 3 1
14 3 5 1 2 3 1 4 4
15 5 2 5 3 1 1 1 1 2
16 2 2 3 4 1
17 4 4 5 1 2 1 1 4
18 4 1 2 2 1 1 4 1 3
19 5 5 1 2 2 1 1 3 3
20 2 4 4 5 3 3 3 2 3
run;
proc means data=HAVE(keep=A: B:) noprint;
output out=MEANS(drop=_:) mean=/autoname;
run;
data WANT;
set HAVE ;
retain P 1;
set MEANS point=P;
array A [*] A1-A4 ;
array B [*] B1-B5 ;
keep ID A1-A4 B1-B5 ;
if 0 < nmiss(of A[*]) < n(of A[*]) then do I=1 to dim(A);
A[I]= coalesce( A[I], vvaluex(cats('A',I,'_MEAN' )));
end;
if 0 < nmiss(of B[*]) < n(of B[*]) then do I=1 to dim(B);
B[I]= coalesce( B[I], vvaluex(cats('B',I,'_MEAN' )));
end;
run;
Function whichn() cannot find a missing value, so we have to loop through the whole array to replace values sadly.
id | a1 | a2 | a3 | a4 | b1 | b2 | b3 | b4 | b5 |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 5 | 4 | 1 | 1 | 3 | 2 | 3 | 4 |
2 | 4 | 1 | 3 | 5 | 1 | 4 | 3 | 3 | 1 |
3 | 2 | 3 | 2.82 | 3 | 3 | 3 | 3 | 2 | 3 |
4 | 3 | 3 | 2 | 5 | 2 | 3 | 1 | 1 | 1 |
5 | 3 | 3 | 2 | 4 | 4 | 4 | 1 | 3 | 2 |
6 | 1 | 3.24 | 2 | 1 | 2.05 | 3 | 2 | 3 | 4 |
7 | 3 | 2 | 1 | 4 | 2 | 2 | 4 | 2.63 | 4 |
8 | 3 | 5 | 4 | 3 | 1 | 1 | 3 | 4 | 4 |
9 | 1 | 1 | 2 | 5 | 4 | 2 | 4 | 1 | 3 |
10 | 1 | 3 | 5 | 5 | 1 | 1 | 2 | 4 | 2 |
11 | 3 | . | . | . | 1 | 4 | 2 | 4 | 4 |
12 | 4 | 5 | 1 | 5 | 3 | 2 | 2 | 3 | 4 |
@lizzy28 I'm curious whats the logic between looking at the number of missing row wise, but then replacing the value column wise.
Do multiple steps.
First idenrify the variables to be standardized by getting the number of missing per column. Create a list of the variable names from that and pass it to proc standardize. Here's a sketch (untested).
Proc means data =have n nmiss stackods;
var a1-a4 b1-b5;
ODS output summary=stats;
run;
data var_list;
set stats end=EEof;
length var_list $250.;
retain var_list;
if nmiss/n > 0.5 the do;
var_list = catx(" ", var_list, name);
end;
if eeof then call symputx('var_list', var_list);
run;
proc standardize...
var &var_list;
...
run;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.