Solved: Deleting Observations based on a pattern

joebacon · Posted 04-24-2019 03:13 PM

Hi all,

It has been a little while. I come back to you with what I think should be a pretty simple request.

I have a dataset which i trimmed down to 5 variables. It looks like this:

data WORK.NEO;
  infile datalines dsd truncover;
  input PUBID:BEST. R1HDRK_Dummy:BEST. R2HDRK_Dummy:BEST. R3HDRK_Dummy:BEST. R4HDRK_Dummy:BEST. R5HDRK_Dummy:BEST.;
  format PUBID BEST. R1HDRK_Dummy BEST. R2HDRK_Dummy BEST. R3HDRK_Dummy BEST. R4HDRK_Dummy BEST. R5HDRK_Dummy BEST.;
  label PUBID="PUBID" R1HDRK_Dummy="R1HDRK_Dummy" R2HDRK_Dummy="R2HDRK_Dummy" R3HDRK_Dummy="R3HDRK_Dummy" R4HDRK_Dummy="R4HDRK_Dummy" R5HDRK_Dummy="R5HDRK_Dummy";
datalines;
2 0 0 3 0 0
7 0 0 0 0 5
10 0 0 0 0 0
21 1 0 0 0 0
22 1 0 0 0 0
23 0 0 . . 5
29 0 0 0 0 .
31 0 0 0 0 0
35 0 0 3 0 .
41 0 0 0 0 0
;;;;

I am trying to work on a survival model. As such, I ran a missing data report to find that out of the 1858 observations, 1494 have complete data. However, I would like to include some of the incomplete data as these variables are simple, dichotomous dummy-coded variables. Each variable is a different year (1-5). I would ideally like to keep any data that has an occurrence and then is missing. For instance, I would like to keep line 9 (PID 35) because they reported binge drinking in their 3rd year, but then had missing data. Eventually, I will replace the missing values after the first occurrence with 0's but right now I am focused on the deletion.

I do not want any occurrence that I cannot tell if it was their first year binge drinking. For example, line 6 (PID 23) reported a case of binge drinking in year 5, but I do not know if they also did in year 3 or 4.

I wanted to see if anyone had any opinion on if this is best practice with deleting missing values for a survival model as well as some suggestions on code.

I tried using the following code, but the number remaining, 1576 seems rather low. I also know that this code is not streamlined, but it was the best I could do.

data Neo_Clean;
set neo;
if missing(R1HDRK_Dummy) and missing(R2HDRK_Dummy) and missing(R3HDRK_Dummy) and missing(R4HDRK_Dummy) and missing(R5HDRK_Dummy)  then delete;
if missing(R1HDRK_Dummy) then delete;
if  R5HDRK_Dummy > 0 and missing(R1HDRK_Dummy) or missing(R2HDRK_Dummy) or missing(R3HDRK_Dummy) or missing(R4HDRK_Dummy) then delete;
if  R4HDRK_Dummy > 0 and missing(R1HDRK_Dummy) or missing(R2HDRK_Dummy) or missing(R3HDRK_Dummy) then delete;
if  R3HDRK_Dummy > 0 and missing(R1HDRK_Dummy) or missing(R2HDRK_Dummy) then delete;
if  R2HDRK_Dummy > 0 and missing(R1HDRK_Dummy)  then delete;
run;

Any help would be appreciated.

novinosrin · Posted 04-24-2019 03:34 PM

Hi @joebacon

Not sure whether my understanding is accurate. But is this what you are after by any chance?

data WORK.NEO;
  infile datalines  truncover;
  input PUBID:BEST. R1HDRK_Dummy:BEST. R2HDRK_Dummy:BEST. R3HDRK_Dummy:BEST. R4HDRK_Dummy:BEST. R5HDRK_Dummy:BEST.;
  format PUBID BEST. R1HDRK_Dummy BEST. R2HDRK_Dummy BEST. R3HDRK_Dummy BEST. R4HDRK_Dummy BEST. R5HDRK_Dummy BEST.;
  label PUBID="PUBID" R1HDRK_Dummy="R1HDRK_Dummy" R2HDRK_Dummy="R2HDRK_Dummy" R3HDRK_Dummy="R3HDRK_Dummy" R4HDRK_Dummy="R4HDRK_Dummy" R5HDRK_Dummy="R5HDRK_Dummy";
datalines;
2 0 0 3 0 0
7 0 0 0 0 5
10 0 0 0 0 0
21 1 0 0 0 0
22 1 0 0 0 0
23 0 0 . . 5
29 0 0 0 0 .
31 0 0 0 0 0
35 0 0 3 0 .
41 0 0 0 0 0
;;;;

data want;
set neo;
k=compress(cats(of R1HDRK_Dummy--R5HDRK_Dummy),'0');
if anydigit(k)=1 and index(k,'.') ;
drop k;
run;

View solution in original post

novinosrin · Posted 04-24-2019 03:34 PM

Hi @joebacon

Not sure whether my understanding is accurate. But is this what you are after by any chance?

data WORK.NEO;
  infile datalines  truncover;
  input PUBID:BEST. R1HDRK_Dummy:BEST. R2HDRK_Dummy:BEST. R3HDRK_Dummy:BEST. R4HDRK_Dummy:BEST. R5HDRK_Dummy:BEST.;
  format PUBID BEST. R1HDRK_Dummy BEST. R2HDRK_Dummy BEST. R3HDRK_Dummy BEST. R4HDRK_Dummy BEST. R5HDRK_Dummy BEST.;
  label PUBID="PUBID" R1HDRK_Dummy="R1HDRK_Dummy" R2HDRK_Dummy="R2HDRK_Dummy" R3HDRK_Dummy="R3HDRK_Dummy" R4HDRK_Dummy="R4HDRK_Dummy" R5HDRK_Dummy="R5HDRK_Dummy";
datalines;
2 0 0 3 0 0
7 0 0 0 0 5
10 0 0 0 0 0
21 1 0 0 0 0
22 1 0 0 0 0
23 0 0 . . 5
29 0 0 0 0 .
31 0 0 0 0 0
35 0 0 3 0 .
41 0 0 0 0 0
;;;;

data want;
set neo;
k=compress(cats(of R1HDRK_Dummy--R5HDRK_Dummy),'0');
if anydigit(k)=1 and index(k,'.') ;
drop k;
run;

joebacon · Posted 04-24-2019 03:55 PM

It is very close! This gets me all those who had an occurrence and then had missing data. I do want to keep these individuals in addition to the other 1494 with complete data.

Like always, @novinosrin, great use of the compress(cats()). That's very handy and a clever way to deal with this.

novinosrin · Posted 04-24-2019 03:57 PM

" I do want to keep these individuals in addition to the other 1494 with complete data."

I am not understanding. Do you mean you just want to Flag?

joebacon · Posted 04-24-2019 04:08 PM

I can work with what you gave.

It outputs the 77 individuals who had an occurrence and then went missing. I also want to include in that dataset the 1494 that have no missing data. I will just merge the two datasets 🙂

novinosrin · Posted 04-24-2019 04:09 PM

Ok cool

Deleting Observations based on a pattern

Re: Deleting Observations based on a pattern

Re: Deleting Observations based on a pattern

Re: Deleting Observations based on a pattern

Re: Deleting Observations based on a pattern

Re: Deleting Observations based on a pattern

Re: Deleting Observations based on a pattern

Deleting Observations based on a pattern

Re: Deleting Observations based on a pattern

Re: Deleting Observations based on a pattern

Re: Deleting Observations based on a pattern

Re: Deleting Observations based on a pattern

Re: Deleting Observations based on a pattern

Re: Deleting Observations based on a pattern

Registration is open