04-24-2018 11:11 AM - edited 04-24-2018 11:14 AM
The question I have is I am cleaning data for use with a predictive model and one of the tables I am looking at has to do with disease cases. I have gender, disease, year, age group, population as variables.
The problem I am running into is : for example I have osteoporosis disease with 20,000 cases for males and 20,000 cases for females. For the "all" gender there is missing value, how can I make this calculation?
04-24-2018 11:38 AM
Yes, there is only a missing value in all if there is a value for both male and female.
If there is a value missing for females then there is a value for male and all for that age group/disease category.
04-24-2018 11:41 AM
I think you meant to say:
Yes, there is only a missing value in all if there is a missing value for both male and female.
In this case, there is nothing that you can do to fix this. There is no way to calculate the All category without individual cases for either Female or Male.
04-24-2018 11:44 AM
For each missing "All" value, the male and female values are there.
For each missing "Female" values, the all and male values are there.
For each missing "Male" value, the all and female values are there.
04-24-2018 12:15 PM
I've recreated a small sample of your sample here:
data diseases; infile datalines dlm=' ' dsd; input year age $ gender $ disease :$50. cases pop; datalines; 1990 0-17 All Asthma 182000 64177000 1990 0-17 All "Dementia and Alzheimer's disease" . 64177000 1990 0-17 Female Asthma 71000 31295000 1990 0-17 Female "Dementia and Alzheimer's disease" 22000 31295000 1990 0-17 Male Asthma 111000 32883000 1990 0-17 Male "Dementia and Alzheimer's disease" . 32883000 run;
So if you do have a scenario as you mentioned, the easiest way to validate your data is to transpose it and then check that all = sum of females and males, as below. If it isn't, then make it so.
proc sort data=diseases; by year age disease; run; proc transpose data=diseases out=diseases_t; by year age disease; var cases pop; id gender; run; data diseases_validated; set diseases_t; valid = (all=sum(female,male)); run;