Solved: Re: Last observation carried to previous missing and the missing after

Cruise · Posted 10-05-2018 09:27 AM

I have a repeated measurement with missing data and trying to use the observation available carried backward ad forward to substitute the missing before and after the observation.

What would the solution here?

Thanks in advance!

data have;
input pat_id	hemoglobin;
cards;
1	.
1	123
1	.
1	124
1	125
1	121
1	.
1	126
2	120
2	220
2	.
2	.
2	.
2	201
2	.
2	.
3   .
3   .
3   .
3   .
3   150
; 

data want;
input pat_id hemoglobin;
cards;
1	123
1	123
1	123
1	124
1	125
1	121
1	121
1	126
2	120
2	220
2	220
2	220
2	220
2	201
2	201
2	201
3   150
3   150
3   150
3   150
3   150
run;

Kurt_Bremser · Posted 10-05-2018 09:34 AM

Carry over once, sort in reverse order, carry over a second time, re-sort to original order:

data have;
input pat_id hemoglobin;
cards;
1 .
1 123
1 .
1 124
1 125
1 121
1 .
1 126
2 120
2 220
2 .
2 .
2 .
2 201
2 .
2 .
3 .
3 .
3 .
3 .
3 150
;
run;

data int1;
set have;
by pat_id;
retain _hem;
if first.pat_id then _hem = hemoglobin;
if hemoglobin = .
then hemoglobin = _hem;
else _hem = hemoglobin;
order = _n_;
drop _hem;
run;

proc sort data=int1;
by pat_id descending order;
run;

data int2;
set int1;
by pat_id;
retain _hem;
if first.pat_id then _hem = hemoglobin;
if hemoglobin = .
then hemoglobin = _hem;
else _hem = hemoglobin;
drop _hem;
run;

proc sort
  data=int2
  out=want (drop=order)
;
by order;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 10-05-2018 09:34 AM

Carry over once, sort in reverse order, carry over a second time, re-sort to original order:

data have;
input pat_id hemoglobin;
cards;
1 .
1 123
1 .
1 124
1 125
1 121
1 .
1 126
2 120
2 220
2 .
2 .
2 .
2 201
2 .
2 .
3 .
3 .
3 .
3 .
3 150
;
run;

data int1;
set have;
by pat_id;
retain _hem;
if first.pat_id then _hem = hemoglobin;
if hemoglobin = .
then hemoglobin = _hem;
else _hem = hemoglobin;
order = _n_;
drop _hem;
run;

proc sort data=int1;
by pat_id descending order;
run;

data int2;
set int1;
by pat_id;
retain _hem;
if first.pat_id then _hem = hemoglobin;
if hemoglobin = .
then hemoglobin = _hem;
else _hem = hemoglobin;
drop _hem;
run;

proc sort
  data=int2
  out=want (drop=order)
;
by order;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Cruise · Posted 10-05-2018 09:51 AM

I appreciate it!

RW9 · Posted 10-05-2018 09:40 AM

What does this data represent? If just shows some value and an id. Its highly likely that you real data has more information than this, such as timepoints/datetimes or visits etc. which would make identifying where values should be imputed a lot more robust and simpler. From what you post here, just moving the value before and after doesn't really seem the ideal method, I mean:

1	121
1	.
1	126

Should 121 be carried forward, or 126 be imputed backwards?

Cruise · Posted 10-05-2018 09:43 AM

Yeah, that's such a conceptual argument. I have to think of the underlying biological plausibility. But the extent of analysis I'm doing right now is not taking full advantage of the data yet. But definitely an important aspect to think about when I actually start looking at it for modelling purposes.

Cruise · Posted 10-05-2018 09:51 AM

I appreciate it

data_null__ · Posted 10-05-2018 10:58 AM

EDIT:

The method below will not work correctly if ID has all missing LOCF variables. This will work.

data have;
   input pat_id hem @@;
   ohem = hem; *for check;
   cards;
1	.   1	123 1	. 1	124 1	125 1	121 1	. 1	126 
2	120 2	220 2	. 2	. 2	. 2	201 2	. 2	. 
3   . 3   . 3   . 3   . 3   150 
4 . 4 .
;;;;
   run; 
proc print;
   run;

data want;
   flag = 0;
   do until(last.pat_id); 
      set have(in=in1);
      by pat_id;
      if flag then continue;
      if not missing(hem) then do; flag=1; back=hem; end;
      end;
   do until(last.pat_id);
      update have(obs=0 drop=ohem) have(drop=ohem);
      by pat_id;
      hem = coalesce(hem,back);
      set have(keep=ohem);
      output;
      end;
   run;
proc print;
   run;

The uses the update trick for the LOCF and accommodates the "look back". Ideally you would like a method that works to LOCF/BACK many variables with having to reference them in the code too often.

data have;
   input pat_id hem @@;
   ohem = hem; *for check;
   cards;
1	.   1	123 1	. 1	124 1	125 1	121 1	. 1	126 
2	120 2	220 2	. 2	. 2	. 2	201 2	. 2	. 
3   . 3   . 3   . 3   . 3   150 
;;;;
   run; 
proc print;
   run;
data backv / view=backv;
   set have(where=(not missing(hem)));
   by pat_id;
   if first.pat_id;
   rename hem=back;
   keep pat_id hem;
   run;
data want;
   if 0 then set have;
   do until(last.pat_id); 
      set backv;
      by pat_id;
      end;
   do until(last.pat_id);
      update have(obs=0 drop=ohem) have(drop=ohem);
      by pat_id;
      hem = coalesce(hem,back);
      set have(keep=ohem);
      output;
      end;
   drop back;
   run;
proc print;
   run;

Cruise · Posted 10-05-2018 11:30 AM

Excellent. Was it more time efficient that forward-then-reverse order approach than Bremser's? Just curious to understand the cons-pros.

data_null__ · Posted 10-05-2018 12:32 PM

@Cruise wrote:
Excellent. Was it more time efficient that forward-then-reverse order approach than Bremser's? Just curious to understand the cons-pros.

If your data are sorted BY id as in your sample data then no sorting is needed.

One advantage of the update trick is you can LOCF many variables. In fact the default would be to LOCF all variables other than ID.

I'm trying to figure a easy way to create a single observation of the FIRST non-missing values for each ID that is simple. Then the look back part could be easily extended to all non-id variables.

novinosrin · Posted 10-05-2018 12:05 PM

When Guru JKing has participated, mine is unimportant but allow me to have fun

data have;
input pat_id	hemoglobin;
cards;
1	.
1	123
1	.
1	124
1	125
1	121
1	.
1	126
2	120
2	220
2	.
2	.
2	.
2	201
2	.
2	.
3   .
3   .
3   .
3   .
3   150
; 

data want;
if _n_=1 then do;
dcl hash h(multidata:'y');
h.definekey('pat_id');
h.definedata('hemoglobin');
h.definedone();
end;
do  until(last.pat_id);
set have;
by pat_id;
if hemoglobin ne . then if h.check() ne 0 then rc=h.add();
end;
call missing(_h);
do until(last.pat_id);
set have;
by pat_id;
if not missing(hemoglobin) then _h=hemoglobin;
else if missing(hemoglobin) and _h then hemoglobin=_h;
else if missing(hemoglobin) then rc=h.find();
output;
end;
drop rc _h;
run;

novinosrin · Posted 10-05-2018 12:31 PM

Or a direct dump into hash, however i was bit concerned about the insertion order earlier and played safe with Double DOW 1. to insert 2. look up

data have;
input pat_id	hemoglobin;
cards;
1	.
1	123
1	.
1	124
1	125
1	121
1	.
1	126
2	120
2	220
2	.
2	.
2	.
2	201
2	.
2	.
3   .
3   .
3   .
3   .
3   150
; 
data want;
if _n_=1 then do;
if 0 then set have;
dcl hash h(dataset:'have(where=(hemoglobin ne .))');
h.definekey('pat_id');
h.definedata('hemoglobin');
h.definedone();
end;
call missing(_h);
do until(last.pat_id);
set have;
by pat_id;
if not missing(hemoglobin) then _h=hemoglobin;
else if missing(hemoglobin) and _h then hemoglobin=_h;
else if missing(hemoglobin) then rc=h.find();
output;
end;
drop rc _h;
run;

novinosrin · Posted 10-05-2018 12:58 PM

Or with update and hash


data have;
input pat_id	hemoglobin;
cards;
1	.
1	123
1	.
1	124
1	125
1	121
1	.
1	126
2	120
2	220
2	.
2	.
2	.
2	201
2	.
2	.
3   .
3   .
3   .
3   .
3   150
; 

data want;
if _n_=1 then do;
if 0 then set have;
dcl hash h(dataset:'have(where=(hemoglobin ne .))');
h.definekey('pat_id');
h.definedata('hemoglobin');
h.definedone();
end;
do until(last.pat_id);
update have(obs=0) have;
by pat_id;
if missing(hemoglobin) then rc=h.find();
output;
end;
drop rc ;
run;

Cruise · Posted 10-05-2018 01:39 PM

@novinosrin

Thanks. This is nice.

By the way, which doens't belong to this thread though is that

I use below code of yours like a built-in function in sas in every program I use. So handy. However, I could not modify your code to deduplicate the repeated values for each individual. So I use proc nodukpey on the resulting dataset.

proc sql;
create table have1(drop=r) as /*N=6,588,823 and unique kids N=2,862,721*/
select *,missing(cat_var) as flag_missing,range(calculated flag_missing) as r
from have /*N=9,876,767*/
group by child_id
having ((r=0)*visit)=min(visit) or (cat_var ne ' ')
order by child_id,visit;
quit;

proc sort data=have1 nodupkey out=have2;

by child_id cat_var;

run;

novinosrin · Posted 10-05-2018 01:55 PM

Thank you @Cruise for your acknowledgement. Answering your questions has always made me learn more and this thread being touch special as It's always pleasure and privilege to read/be part of thread when @data_null__ has participated. I shamelessly plagiarized a lot of DN's code from SAS L and continue to learn just by reading the master's posts. I am sure DN is so used to having many followers like me. My dream is to become as good as the geniuses DN, PD, PG stats. (unrealistic eh lol)

With reference to your other post, I am glad you were able to modify on your own however should you need help, never hesitate. Besides our professional and personal life, we are all a SAS family here where we can freely ask, find and share. 🙂 Take care

Registration is open

SAS Training: Just a Click Away