BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
joebacon
Pyrite | Level 9

Hi all,

 

It has been a little while. I come back to you with what I think should be a pretty simple request.

 

I have a dataset which i trimmed down to 5 variables. It looks like this:

 

data WORK.NEO;
  infile datalines dsd truncover;
  input PUBID:BEST. R1HDRK_Dummy:BEST. R2HDRK_Dummy:BEST. R3HDRK_Dummy:BEST. R4HDRK_Dummy:BEST. R5HDRK_Dummy:BEST.;
  format PUBID BEST. R1HDRK_Dummy BEST. R2HDRK_Dummy BEST. R3HDRK_Dummy BEST. R4HDRK_Dummy BEST. R5HDRK_Dummy BEST.;
  label PUBID="PUBID" R1HDRK_Dummy="R1HDRK_Dummy" R2HDRK_Dummy="R2HDRK_Dummy" R3HDRK_Dummy="R3HDRK_Dummy" R4HDRK_Dummy="R4HDRK_Dummy" R5HDRK_Dummy="R5HDRK_Dummy";
datalines;
2 0 0 3 0 0
7 0 0 0 0 5
10 0 0 0 0 0
21 1 0 0 0 0
22 1 0 0 0 0
23 0 0 . . 5
29 0 0 0 0 .
31 0 0 0 0 0
35 0 0 3 0 .
41 0 0 0 0 0
;;;;

I am trying to work on a survival model. As such, I ran a missing data report to find that out of the 1858 observations, 1494 have complete data. However, I would like to include some of the incomplete data as these variables are simple, dichotomous dummy-coded variables. Each variable is a different year (1-5). I would ideally like to keep any data that has an occurrence and then is missing. For instance, I would like to keep line 9 (PID 35) because they reported binge drinking in their 3rd year, but then had missing data. Eventually, I will replace the missing values after the first occurrence with 0's but right now I am focused on the deletion.

 

I do not want any occurrence that I cannot tell if it was their first year binge drinking. For example, line 6 (PID 23) reported a case of binge drinking in year 5, but I do not know if they also did in year 3 or 4.

 

I wanted to see if anyone had any opinion on if this is best practice with deleting missing values for a survival model as well as some suggestions on code.

 

I tried using the following code, but the number remaining, 1576 seems rather low. I also know that this code is not streamlined, but it was the best I could do.

 

data Neo_Clean;
set neo;
if missing(R1HDRK_Dummy) and missing(R2HDRK_Dummy) and missing(R3HDRK_Dummy) and missing(R4HDRK_Dummy) and missing(R5HDRK_Dummy)  then delete;
if missing(R1HDRK_Dummy) then delete;
if  R5HDRK_Dummy > 0 and missing(R1HDRK_Dummy) or missing(R2HDRK_Dummy) or missing(R3HDRK_Dummy) or missing(R4HDRK_Dummy) then delete;
if  R4HDRK_Dummy > 0 and missing(R1HDRK_Dummy) or missing(R2HDRK_Dummy) or missing(R3HDRK_Dummy) then delete;
if  R3HDRK_Dummy > 0 and missing(R1HDRK_Dummy) or missing(R2HDRK_Dummy) then delete;
if  R2HDRK_Dummy > 0 and missing(R1HDRK_Dummy)  then delete;
run;

Any help would be appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Hi @joebacon 

 

Not sure whether my understanding is accurate. But is this what you are after by any chance?

 

data WORK.NEO;
  infile datalines  truncover;
  input PUBID:BEST. R1HDRK_Dummy:BEST. R2HDRK_Dummy:BEST. R3HDRK_Dummy:BEST. R4HDRK_Dummy:BEST. R5HDRK_Dummy:BEST.;
  format PUBID BEST. R1HDRK_Dummy BEST. R2HDRK_Dummy BEST. R3HDRK_Dummy BEST. R4HDRK_Dummy BEST. R5HDRK_Dummy BEST.;
  label PUBID="PUBID" R1HDRK_Dummy="R1HDRK_Dummy" R2HDRK_Dummy="R2HDRK_Dummy" R3HDRK_Dummy="R3HDRK_Dummy" R4HDRK_Dummy="R4HDRK_Dummy" R5HDRK_Dummy="R5HDRK_Dummy";
datalines;
2 0 0 3 0 0
7 0 0 0 0 5
10 0 0 0 0 0
21 1 0 0 0 0
22 1 0 0 0 0
23 0 0 . . 5
29 0 0 0 0 .
31 0 0 0 0 0
35 0 0 3 0 .
41 0 0 0 0 0
;;;;

data want;
set neo;
k=compress(cats(of R1HDRK_Dummy--R5HDRK_Dummy),'0');
if anydigit(k)=1 and index(k,'.') ;
drop k;
run;

View solution in original post

5 REPLIES 5
novinosrin
Tourmaline | Level 20

Hi @joebacon 

 

Not sure whether my understanding is accurate. But is this what you are after by any chance?

 

data WORK.NEO;
  infile datalines  truncover;
  input PUBID:BEST. R1HDRK_Dummy:BEST. R2HDRK_Dummy:BEST. R3HDRK_Dummy:BEST. R4HDRK_Dummy:BEST. R5HDRK_Dummy:BEST.;
  format PUBID BEST. R1HDRK_Dummy BEST. R2HDRK_Dummy BEST. R3HDRK_Dummy BEST. R4HDRK_Dummy BEST. R5HDRK_Dummy BEST.;
  label PUBID="PUBID" R1HDRK_Dummy="R1HDRK_Dummy" R2HDRK_Dummy="R2HDRK_Dummy" R3HDRK_Dummy="R3HDRK_Dummy" R4HDRK_Dummy="R4HDRK_Dummy" R5HDRK_Dummy="R5HDRK_Dummy";
datalines;
2 0 0 3 0 0
7 0 0 0 0 5
10 0 0 0 0 0
21 1 0 0 0 0
22 1 0 0 0 0
23 0 0 . . 5
29 0 0 0 0 .
31 0 0 0 0 0
35 0 0 3 0 .
41 0 0 0 0 0
;;;;

data want;
set neo;
k=compress(cats(of R1HDRK_Dummy--R5HDRK_Dummy),'0');
if anydigit(k)=1 and index(k,'.') ;
drop k;
run;
joebacon
Pyrite | Level 9
It is very close! This gets me all those who had an occurrence and then had missing data. I do want to keep these individuals in addition to the other 1494 with complete data.

Like always, @novinosrin, great use of the compress(cats()). That's very handy and a clever way to deal with this.
novinosrin
Tourmaline | Level 20

" I do want to keep these individuals in addition to the other 1494 with complete data."

 

I am not understanding. Do you mean you just want to Flag?

joebacon
Pyrite | Level 9
I can work with what you gave.

It outputs the 77 individuals who had an occurrence and then went missing. I also want to include in that dataset the 1494 that have no missing data. I will just merge the two datasets 🙂
novinosrin
Tourmaline | Level 20

Ok cool 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1089 views
  • 1 like
  • 2 in conversation