Hi All , I want to replace Missing values with average. As an example my data set contains variable salary Gender and age_group, i want to replace missing values of Salary with avg for That Gender(which is in Observation)and falls in that particular Age Group.
I want to replace the Missing Value @ Id 8 with Avg Salary of female in Medium Age Group.
Id | Age_group | Gender | Salary |
1 | Young | M | 44701 |
2 | Medium | F | 41755 |
3 | Old | M | 37127 |
4 | Young | F | 85233 |
5 | Medium | M | 67517 |
6 | Old | F | 99904 |
7 | Young | M | 26938 |
8 | Medium | F | |
9 | Old | M | 37583 |
10 | Young | F | 78853 |
11 | Medium | M | |
12 | Old | F | 73138 |
13 | Young | M | 29987 |
14 | Medium | F | 33115 |
15 | Old | M | 50883 |
16 | Young | F | |
17 | Medium | M | 48555 |
18 | Old | F | 62939 |
19 | Young | M | 52488 |
20 | Medium | F | 65333 |
21 | Old | M | 87914 |
do you mean this?
data have;
infile datalines truncover;
input Id Age_group : $15. Gender $ Salary;
cards;
1 Young M 44701
2 Medium F 41755
3 Old M 37127
4 Young F 85233
5 Medium M 67517
6 Old F 99904
7 Young M 26938
8 Medium F
9 Old M 37583
10 Young F 78853
11 Medium M
12 Old F 73138
13 Young M 29987
14 Medium F 33115
15 Old M 50883
16 Young F
17 Medium M 48555
18 Old F 62939
19 Young M 52488
20 Medium F 65333
21 Old M 87914
;
proc sql;
create table want as
select *,sum(salary,mean(salary)*(salary=.)) as s
from have
group by gender,Age_group
order by id;
quit;
do you mean this?
data have;
infile datalines truncover;
input Id Age_group : $15. Gender $ Salary;
cards;
1 Young M 44701
2 Medium F 41755
3 Old M 37127
4 Young F 85233
5 Medium M 67517
6 Old F 99904
7 Young M 26938
8 Medium F
9 Old M 37583
10 Young F 78853
11 Medium M
12 Old F 73138
13 Young M 29987
14 Medium F 33115
15 Old M 50883
16 Young F
17 Medium M 48555
18 Old F 62939
19 Young M 52488
20 Medium F 65333
21 Old M 87914
;
proc sql;
create table want as
select *,sum(salary,mean(salary)*(salary=.)) as s
from have
group by gender,Age_group
order by id;
quit;
The previous response shows you the missing and the average value to be imputed for those missings in the next column
The below one replaces the original salary column
data have;
infile datalines truncover;
input Id Age_group : $15. Gender $ Salary;
cards;
1 Young M 44701
2 Medium F 41755
3 Old M 37127
4 Young F 85233
5 Medium M 67517
6 Old F 99904
7 Young M 26938
8 Medium F
9 Old M 37583
10 Young F 78853
11 Medium M
12 Old F 73138
13 Young M 29987
14 Medium F 33115
15 Old M 50883
16 Young F
17 Medium M 48555
18 Old F 62939
19 Young M 52488
20 Medium F 65333
21 Old M 87914
;
proc sql;
create table want as
select id,Age_group,Gender,sum(salary,mean(salary)*(salary=.)) as salary
from have
group by gender,Age_group
order by id;
quit;
@Savyasachi007 I am about to step out for lunch with my mates. Give me 30 mins please, I will leave some notes.
@Savyasachi007 Sorry took a bit too long as I was enjoying every bite lol
Here are the notes:
1. There are three processes that is involved in this execution a. group by as mentioned by you b. finding the mean c. finding the missing and imputing the mean
2. Salary=. is binary expression that results in 1 or 0 or in other words True or false. If true which is when missing in our case, results in 1 and when not missing results in 0
3. Once mean is calculated, i multiply the mean with the result of binary expression to make sure I have 0's for not missing to be summed(added) with existing salary and computed mean salary value *1=computed mean salary value for missing salary
4. so this will result in final expression of
0+salaryvalue(for non missing cases)
computedmeansalaryvalue+. (for missing cases)
5. The sum function is applied instead of operator to ignore the missing values and the final result will have your imputed salary safely for your missing cases and original salary for non missing cases respectively
I hope that helps
Best Regards and have a nice day.
Thanks A ton,
Have a great day.
Replacing missing values by the mean is called "mean imputation," You can read about how to use PROC STDIZE to perform mean imputation---and why mean imputation is generally a bad idea, statistically speaking.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.