BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lizzy28
Quartz | Level 8

Hi everyone,

 

I am trying to standardize my data by replacing missing with means. However, only when the number of missing for a series of variables is less than half of the number of the series of variables (that is why I could not use proc standard). The data I have looks like below:

id a1 a2 a3 a4 b1 b2 b3 b4 b5
1 1 5 4 1 1 3 2 3 4
2 4 1 3 5 1 4 3 3 1
3 2 3   3 3 3 3 2 3
4 3 3 2 5 2 3 1 1 1
5 3 3 2 4 4 4 1 3 2
6 1   2 1   3 2 3 4
7 3 2 1 4 2 2 4   4
8 3 5 4 3 1 1 3 4 4
9 1 1 2 5 4 2 4 1 3
10 1 3 5 5 1 1 2 4 2
11 3       1 4 2 4 4
12 4 5 1 5 3 2 2 3 4
13 1 3 4 2 2 2 1 3 1
14 3 5 1 2 3   1 4 4
15 5 2 5 3 1 1 1 1 2
16         2 2 3 4 1
17 4 4 5 1 2 1   1 4
18 4 1 2 2 1 1 4 1 3
19 5 5 1 2 2 1 1 3 3
20 2 4 4 5 3 3 3 2 3

 

I have to leave missing the values of a2, a3, a4 for id#11 and the values of a1-a4 for id #16 because nmiss(of a1-a4) is more than half of the number of variables. The rest of missing values, a3 for id#4,a2 for id#6, b1 for id#6, b4 for id#7, ... can be replaced by the means of the corresponding variable.

 

I start my code like this:

data want;set have;

array avar_all a1-a4;

array bvar_all b1-b5;

DO i=1 to dim(avar_all);

IF nmiss(of a1-a4)<4 THEN DO;

IF avar_all[i]=. THEN avar_all[i]=mean(???);

END;

END;

DO i=1 to dim(bvar_all);

IF nmiss(of b1-b5)<4 THEN DO;

IF bvar_all[i]=. THEN bvar_all[i]=mean(???);

END;

END;

run;

 

Anyone knows what I should put in (???)?

 

Thanks a lot!

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Like this? [Edited as I wasn't computing a column average]

 

 

data HAVE;
input ID $3. (A1 A2 A3 A4 B1 B2 B3 B4 B5) (2.); 
cards;
1  1 5 4 1 1 3 2 3 4 
2  4 1 3 5 1 4 3 3 1 
3  2 3   3 3 3 3 2 3 
4  3 3 2 5 2 3 1 1 1 
5  3 3 2 4 4 4 1 3 2 
6  1   2 1   3 2 3 4 
7  3 2 1 4 2 2 4   4 
8  3 5 4 3 1 1 3 4 4 
9  1 1 2 5 4 2 4 1 3 
10 1 3 5 5 1 1 2 4 2 
11 3       1 4 2 4 4 
12 4 5 1 5 3 2 2 3 4 
13 1 3 4 2 2 2 1 3 1 
14 3 5 1 2 3   1 4 4 
15 5 2 5 3 1 1 1 1 2 
16         2 2 3 4 1 
17 4 4 5 1 2 1   1 4 
18 4 1 2 2 1 1 4 1 3 
19 5 5 1 2 2 1 1 3 3 
20 2 4 4 5 3 3 3 2 3 
run;
proc means data=HAVE(keep=A: B:) noprint; 
  output out=MEANS(drop=_:) mean=/autoname;
run;   
data WANT;
  set HAVE ;
  retain P 1;
  set MEANS point=P;
  array A [*] A1-A4 ;
  array B [*] B1-B5 ;
  keep ID A1-A4 B1-B5 ;
  if 0 < nmiss(of A[*]) < n(of A[*]) then do I=1 to dim(A);
    A[I]= coalesce( A[I], vvaluex(cats('A',I,'_MEAN' )));
  end;
  if 0 < nmiss(of B[*]) < n(of B[*]) then do I=1 to dim(B);
    B[I]= coalesce( B[I], vvaluex(cats('B',I,'_MEAN' )));
  end;
run;

 

Function whichn() cannot find a missing value, so we have to loop through the whole array to replace values sadly.

 

 

 

id a1 a2 a3 a4 b1 b2 b3 b4 b5
1 1 5 4 1 1 3 2 3 4
2 4 1 3 5 1 4 3 3 1
3 2 3 2.82 3 3 3 3 2 3
4 3 3 2 5 2 3 1 1 1
5 3 3 2 4 4 4 1 3 2
6 1 3.24 2 1 2.05 3 2 3 4
7 3 2 1 4 2 2 4 2.63 4
8 3 5 4 3 1 1 3 4 4
9 1 1 2 5 4 2 4 1 3
10 1 3 5 5 1 1 2 4 2
11 3 . . . 1 4 2 4 4
12 4 5 1 5 3 2 2 3 4

View solution in original post

7 REPLIES 7
KachiM
Rhodochrosite | Level 12

 

IF avar_all[i]=. THEN avar_all[i]=mean(???);

 

Use  mean(of avar_all[*]);

 

 

IF bvar_all[i]=. THEN bvar_all[i]=mean(???);

Use  mean(of bvar_all[*]);

 

lizzy28
Quartz | Level 8

Thank you!

 

mean(of avar_all[*]) gives the mean of across-variable, a1-a4. However, I need to replace the missing with the mean of other values of the same variable (just like proc standard does).

 

For example, a3 value for id#3 should be replaced by the mean of a3, 2.82.

KachiM
Rhodochrosite | Level 12

I don't see how you are getting 2.82

Give the individual values as how you are deriving it.

ChrisNZ
Tourmaline | Level 20

Like this? [Edited as I wasn't computing a column average]

 

 

data HAVE;
input ID $3. (A1 A2 A3 A4 B1 B2 B3 B4 B5) (2.); 
cards;
1  1 5 4 1 1 3 2 3 4 
2  4 1 3 5 1 4 3 3 1 
3  2 3   3 3 3 3 2 3 
4  3 3 2 5 2 3 1 1 1 
5  3 3 2 4 4 4 1 3 2 
6  1   2 1   3 2 3 4 
7  3 2 1 4 2 2 4   4 
8  3 5 4 3 1 1 3 4 4 
9  1 1 2 5 4 2 4 1 3 
10 1 3 5 5 1 1 2 4 2 
11 3       1 4 2 4 4 
12 4 5 1 5 3 2 2 3 4 
13 1 3 4 2 2 2 1 3 1 
14 3 5 1 2 3   1 4 4 
15 5 2 5 3 1 1 1 1 2 
16         2 2 3 4 1 
17 4 4 5 1 2 1   1 4 
18 4 1 2 2 1 1 4 1 3 
19 5 5 1 2 2 1 1 3 3 
20 2 4 4 5 3 3 3 2 3 
run;
proc means data=HAVE(keep=A: B:) noprint; 
  output out=MEANS(drop=_:) mean=/autoname;
run;   
data WANT;
  set HAVE ;
  retain P 1;
  set MEANS point=P;
  array A [*] A1-A4 ;
  array B [*] B1-B5 ;
  keep ID A1-A4 B1-B5 ;
  if 0 < nmiss(of A[*]) < n(of A[*]) then do I=1 to dim(A);
    A[I]= coalesce( A[I], vvaluex(cats('A',I,'_MEAN' )));
  end;
  if 0 < nmiss(of B[*]) < n(of B[*]) then do I=1 to dim(B);
    B[I]= coalesce( B[I], vvaluex(cats('B',I,'_MEAN' )));
  end;
run;

 

Function whichn() cannot find a missing value, so we have to loop through the whole array to replace values sadly.

 

 

 

id a1 a2 a3 a4 b1 b2 b3 b4 b5
1 1 5 4 1 1 3 2 3 4
2 4 1 3 5 1 4 3 3 1
3 2 3 2.82 3 3 3 3 2 3
4 3 3 2 5 2 3 1 1 1
5 3 3 2 4 4 4 1 3 2
6 1 3.24 2 1 2.05 3 2 3 4
7 3 2 1 4 2 2 4 2.63 4
8 3 5 4 3 1 1 3 4 4
9 1 1 2 5 4 2 4 1 3
10 1 3 5 5 1 1 2 4 2
11 3 . . . 1 4 2 4 4
12 4 5 1 5 3 2 2 3 4
Reeza
Super User

@lizzy28 I'm curious whats the logic between looking at the number of missing row wise, but then replacing the value column wise.

lizzy28
Quartz | Level 8
Thanks, Reeza.
It is a survey data I'm working on. For each measurement, there are multiple items. For example, a1-a4 is to measure individuals' mental health, and b1-b5 is to measure physical health. The sum of a1-a4 is the final score for a person. Therefore, if one person only completed one item out of four, the person cannot be counted as a valid observation.
Reeza
Super User

Do multiple steps. 

First idenrify the variables to be standardized by getting the number of missing per column. Create a list of the variable names from that and pass it to proc standardize. Here's a sketch (untested). 

 

Proc means data =have n nmiss stackods;

var a1-a4 b1-b5;

ODS output summary=stats;

run;

 

data var_list;

set stats end=EEof;

length var_list $250.;

retain var_list;

 

if nmiss/n > 0.5 the do;

var_list = catx(" ", var_list, name);

end;

 

if eeof then call symputx('var_list', var_list);

 

run;

 

proc standardize...

var &var_list;

...

run;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 2901 views
  • 1 like
  • 4 in conversation