Calcite | Level 5

Fixing Missing Data

Hello everyone,

The question I have is I am cleaning data for use with a predictive model and one of the tables I am looking at has to do with disease cases. I have gender, disease, year, age group, population as variables.

The problem I am running into is : for example I have osteoporosis disease with 20,000 cases for males and 20,000 cases for females. For the "all" gender there is missing value, how can I make this calculation?

6 REPLIES 6
SAS Employee

Re: Fixing Missing Data

From the sample you showed, the only time that the All category has missing cases is when there are missing cases for both Females and Males. Is this always the case?

Calcite | Level 5

Re: Fixing Missing Data

Yes, there is only a missing value in all if there is a value for both male and female.

If there is a value missing for females then there is a value for male and all for that age group/disease category.

SAS Employee

Re: Fixing Missing Data

I think you meant to say:

Yes, there is only a missing value in all if there is a missing value for both male and female.

In this case, there is nothing that you can do to fix this. There is no way to calculate the All category without individual cases for either Female or Male.

Calcite | Level 5

Re: Fixing Missing Data

For each missing "All" value, the male and female values are there.

For each missing "Female" values, the all and male values are there.

For each missing "Male" value, the all and female values are there.

SAS Employee

Re: Fixing Missing Data

This doesn't seem to be reflected in your sample. Is your data different?

SAS Employee

Re: Fixing Missing Data

I've recreated a small sample of your sample here:

``````data diseases;
infile datalines dlm=' ' dsd;
input year age \$ gender \$ disease :\$50. cases pop;
datalines;
1990 0-17 All Asthma 182000 64177000
1990 0-17 All "Dementia and Alzheimer's disease" . 64177000
1990 0-17 Female Asthma 71000 31295000
1990 0-17 Female "Dementia and Alzheimer's disease" 22000 31295000
1990 0-17 Male Asthma 111000 32883000
1990 0-17 Male "Dementia and Alzheimer's disease" . 32883000
run;``````

So if you do have a scenario as you mentioned, the easiest way to validate your data is to transpose it and then check that all = sum of females and males, as below. If it isn't, then make it so.

``````proc sort data=diseases;
by year age disease;
run;

proc transpose data=diseases out=diseases_t;
by year age disease;
var cases pop;
id gender;
run;

data diseases_validated;
set diseases_t;
valid = (all=sum(female,male));
run;``````

Discussion stats
• 6 replies
• 900 views
• 0 likes
• 2 in conversation