Solved: Delete missing cases where one variable has more than 50% missing

joebacon · Posted 11-16-2018 10:45 AM

Hi all! I hope your Fridays are going well.

As the title suggests, I had a question about missing cases and how to delete cases where one variable is missing more than 50% of the cases.

I have a set of longitudinal data and want to do multiple Imputation (MI) on the data to account for some of the missing data. However, since the data is longitudinal and revolves around weight (fluctuating seemingly randomly), I wanted to delete said cases. A sample of my data is below:

data WORK.BMI_DOB;
  infile datalines dsd truncover;
  input ID:32. Gender:32. DOB_M:32. DOB_Y:32. Race:32. Year:32. Height:32. Weight:32. BMI:6.3 DOB:MONYY7. Age:32.;
  format BMI 6.3 DOB MONYY7.;

label ID="PUBID - YTH ID CODE 1997" Gender="1= Male, 2=Female" DOB_M="Date of Birth Month" DOB_Y="Date of Birth Year" Race="1= Black, 2= Hispanic, 3= Mixed Race (Non-Hispanic) 4= Non-Black / Non-Hispanic" Year="Year the Survey took place" Height="Height in
 Inches" Weight="Weight in pounds" BMI="Body Mass Index";
datalines;
1 2 9 1981 4 1997 67 145 22.708 SEP1981 16
1 2 9 1981 4 1998 67 150 23.491 SEP1981 17
1 2 9 1981 4 1999 67 150 23.491 SEP1981 18
1 2 9 1981 4 2000 68 160 24.325 SEP1981 19
1 2 9 1981 4 2001 67 163 25.527 SEP1981 20
1 2 9 1981 4 2002 67 155 24.274 SEP1981 21
1 2 9 1981 4 2003 67 153 23.961 SEP1981 22
1 2 9 1981 4 2004 67 160 25.057 SEP1981 23
1 2 9 1981 4 2005 67 160 25.057 SEP1981 24
1 2 9 1981 4 2006 67 157 24.587 SEP1981 25
1 2 9 1981 4 2007 67 156 24.430 SEP1981 26
1 2 9 1981 4 2008 67 160 25.057 SEP1981 27
1 2 9 1981 4 2009 67 158 24.744 SEP1981 28
1 2 9 1981 4 2010 67 152 23.804 SEP1981 29
1 2 9 1981 4 2011 67 155 24.274 SEP1981 30
1 2 9 1981 4 2013 67 152 23.804 SEP1981 32
1 2 9 1981 4 2015 67 175 27.406 SEP1981 34
2 1 7 1982 2 1997 67 135 21.142 JUL1982 15
2 1 7 1982 2 1998 69 150 22.149 JUL1982 16
2 1 7 1982 2 1999 67 140 21.925 JUL1982 17
2 1 7 1982 2 2000 66 150 24.208 JUL1982 18
2 1 7 1982 2 2001 67 158 24.744 JUL1982 19
2 1 7 1982 2 2002 67 163 25.527 JUL1982 20
2 1 7 1982 2 2003 67 170 26.623 JUL1982 21
2 1 7 1982 2 2004 67 180 28.189 JUL1982 22
2 1 7 1982 2 2005 65 175 29.118 JUL1982 23
2 1 7 1982 2 2006 . . . JUL1982 24
2 1 7 1982 2 2007 . . . JUL1982 25
2 1 7 1982 2 2008 67 175 27.406 JUL1982 26
2 1 7 1982 2 2009 67 181 28.346 JUL1982 27
2 1 7 1982 2 2010 67 189 29.598 JUL1982 28
2 1 7 1982 2 2011 67 175 27.406 JUL1982 29
2 1 7 1982 2 2013 67 180 28.189 JUL1982 31
2 1 7 1982 2 2015 67 225 35.236 JUL1982 33
;;;;

After reading a few other threads and some prior knowledge, I think you can do this with a SQL command. I would like to think I am close by first trying to make a table of only the missing values like such:

proc sql;
 
   create table Missing_Gone as
   select * from BMI_DOB
   where ID not in 
   (select ID from BMI_DOB where cmiss(Weight) > 8);
quit;

This code is still giving me EVERY ID. I am guessing this is because it is saying that in the "weight" variable there are more than 8 cases missing. However, I want it 8 cases for each ID.

As stated, I want to keep the delete all the cases of that ID where they are missing more than 8 cases of Weight as there are 17 total check-ins from 1997-2015. I hope i was clear. Let me know if you need further explanation.

Any guidance would be appreciated.

Thank you in advance.

FreelanceReinh · Posted 11-16-2018 12:09 PM

Hi @joebacon,

Try this:

proc sql;
create table Missing_Gone as
select * from BMI_DOB
group by ID
having nmiss(weight)/count(*)<=0.5;
quit;

This selects all observations of only those IDs for which the ratio "number of missing values of WEIGHT divided by number of observations" is less than or equal to 50%.

Edit:

The reason why you have to use NMISS here and not CMISS is that NMISS is an SQL aggregate function (i.e. one that works vertically, across observations) if called with only one argument, whereas CMISS can only work horizontally, i.e. within the same observation (and is therefore often called with more than one argument).

The SQL aggregate function NMISS (i.e. with one argument) accepts either a character or a numeric argument. If used with two or more arguments, NMISS works like in the DATA step (horizontally) and accepts only numeric arguments.* CMISS accepts always both types, even within the same call ("C" stands for "count").

*[Edit 2:] Unlike the NMISS function in the DATA step it does not attempt to convert character arguments to numeric values.

View solution in original post

joebacon · Posted 11-16-2018 10:56 AM

I do realize that it should be this code:

proc sql;
 
   create table Missing_Gone as
   select * from BMI_DOB
   where ID in 
   (select ID from BMI_DOB where cmiss(Weight) > 8);
quit;

Which has "in" instead of "not in". However, this code produces no observations.

Reeza · Posted 11-16-2018 11:13 AM

Calculate missing across each ID first, I don't think CMISS will do that, it's character miss and you have a numeric variable. Try NMISS instead.

joebacon · Posted 11-16-2018 11:21 AM

Are you saying its a whole different code first to calculate missing across ID where I make a new variable and then use said variable?

I tried simply changing to nmiss (good catch) as such:

proc sql;
 
   create table Missing_Gone as
   select * from BMI_DOB
   where ID in 
   (select ID from BMI_DOB where nmiss(Weight) > 8);
quit;

However, I was getting the error:

342        
343        
344        proc sql;
345        
346           create table Missing_Gone as
347           select * from BMI_DOB
348           where ID in
349           (select ID from BMI_DOB where nmiss(Weight) > 8);
ERROR: Summary functions are restricted to the SELECT and HAVING clauses only.
NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
350        quit;

r_behata · Posted 11-16-2018 12:08 PM

Is this what you are looking for ?

data WORK.BMI_DOB;
  infile datalines  truncover;
  input ID:32. Gender:32. DOB_M:32. DOB_Y:32. Race:32. Year:32. Height:32. Weight:32. BMI:6.3 DOB:MONYY7. Age:32.;
  format BMI 6.3 DOB MONYY7.;

label ID="PUBID - YTH ID CODE 1997" Gender="1= Male, 2=Female" DOB_M="Date of Birth Month" DOB_Y="Date of Birth Year" Race="1= Black, 2= Hispanic, 3= Mixed Race (Non-Hispanic) 4= Non-Black / Non-Hispanic" Year="Year the Survey took place" Height="Height in
 Inches" Weight="Weight in pounds" BMI="Body Mass Index";
datalines;
1 2 9 1981 4 1997 67 145 22.708 SEP1981 16
1 2 9 1981 4 1998 67 150 23.491 SEP1981 17
1 2 9 1981 4 1999 67 150 23.491 SEP1981 18
1 2 9 1981 4 2000 68 160 24.325 SEP1981 19
1 2 9 1981 4 2001 67 163 25.527 SEP1981 20
1 2 9 1981 4 2002 67 155 24.274 SEP1981 21
1 2 9 1981 4 2003 67 153 23.961 SEP1981 22
1 2 9 1981 4 2004 67 160 25.057 SEP1981 23
1 2 9 1981 4 2005 67 160 25.057 SEP1981 24
1 2 9 1981 4 2006 67 157 24.587 SEP1981 25
1 2 9 1981 4 2007 67 156 24.430 SEP1981 26
1 2 9 1981 4 2008 67 160 25.057 SEP1981 27
1 2 9 1981 4 2009 67 158 24.744 SEP1981 28
1 2 9 1981 4 2010 67 152 23.804 SEP1981 29
1 2 9 1981 4 2011 67 155 24.274 SEP1981 30
1 2 9 1981 4 2013 67 152 23.804 SEP1981 32
1 2 9 1981 4 2015 67 175 27.406 SEP1981 34
2 1 7 1982 2 1997 67 135 21.142 JUL1982 15
2 1 7 1982 2 1998 69 150 22.149 JUL1982 16
2 1 7 1982 2 1999 67 140 21.925 JUL1982 17
2 1 7 1982 2 2000 66 150 24.208 JUL1982 18
2 1 7 1982 2 2001 67 158 24.744 JUL1982 19
2 1 7 1982 2 2002 67 163 25.527 JUL1982 20
2 1 7 1982 2 2003 67 170 26.623 JUL1982 21
2 1 7 1982 2 2004 67 180 28.189 JUL1982 22
2 1 7 1982 2 2005 65 175 29.118 JUL1982 23
2 1 7 1982 2 2006 . . . JUL1982 24
2 1 7 1982 2 2007 . . . JUL1982 25
2 1 7 1982 2 2008 67 . 27.406 JUL1982 26
2 1 7 1982 2 2009 67 . 28.346 JUL1982 27
2 1 7 1982 2 2010 67 . 29.598 JUL1982 28
2 1 7 1982 2 2011 67 . 27.406 JUL1982 29
2 1 7 1982 2 2013 67 . 28.189 JUL1982 31
2 1 7 1982 2 2015 67 . 35.236 JUL1982 33
run;

data want;

miss_=0;
do until(last.id);
	set BMI_DOB;
	by id;

	if missing(Weight) then miss_+1;
end;

do until(last.id);
	set BMI_DOB;
	by id;

	if miss_ < 8; 
	output;
end;

drop miss_;
run;

FreelanceReinh · Posted 11-16-2018 12:09 PM

Hi @joebacon,

Try this:

proc sql;
create table Missing_Gone as
select * from BMI_DOB
group by ID
having nmiss(weight)/count(*)<=0.5;
quit;

This selects all observations of only those IDs for which the ratio "number of missing values of WEIGHT divided by number of observations" is less than or equal to 50%.

Edit:

The reason why you have to use NMISS here and not CMISS is that NMISS is an SQL aggregate function (i.e. one that works vertically, across observations) if called with only one argument, whereas CMISS can only work horizontally, i.e. within the same observation (and is therefore often called with more than one argument).

The SQL aggregate function NMISS (i.e. with one argument) accepts either a character or a numeric argument. If used with two or more arguments, NMISS works like in the DATA step (horizontally) and accepts only numeric arguments.* CMISS accepts always both types, even within the same call ("C" stands for "count").

*[Edit 2:] Unlike the NMISS function in the DATA step it does not attempt to convert character arguments to numeric values.

joebacon · Posted 11-16-2018 01:16 PM

@FreelanceReinhard

Not only did you help me with my code, but you explained to me why your change fixed it. Many thanks!

Always like learning new things. This worked exactly as intended.

Delete missing cases where one variable has more than 50% missing

Re: Delete missing cases where one variable has more than 50% missing

Re: Delete missing cases where one variable has more than 50% missing

Re: Delete missing cases where one variable has more than 50% missing

Re: Delete missing cases where one variable has more than 50% missing

Re: Delete missing cases where one variable has more than 50% missing

Re: Delete missing cases where one variable has more than 50% missing

Re: Delete missing cases where one variable has more than 50% missing

Registration is open

SAS Training: Just a Click Away