Solved: Re: Replacing missing values by previous observation

Statsconsultancy · Posted 05-24-2010 08:45 AM

I have a data where patients were prescribed medication. I have four columns labelled ID (for a patient), prescibeddate (date of prescription), drug ( name of the treatment) and quantity ( number of pills). In some patients I have the quantity missing. What I will like to do is to replace the missing quantity with the immediate previous quantity if the previous drug is the same in that patient. OR replace the missing quantity with the next if the next drug is the same. Otherwise the missing quantity is replaced by the standard quantity which is 28 for both treatment A and treatment B. I am working on a large data set. But as an example the original data is of the following format;

ID prescribeddate drug quantity
1 08/jan/2008 A 28
1 08/feb/2008 A 14
1 08/Mar/2008 A
1 08/Apr/2008 B 56
1 08/May/2008 B
2 08/Jan/2008 A 30
2 08/Mar/2009 B
2 28/Mar/2008 A 30
2 08/Mar/2008 B 14
3 08/Jan/2008 B
3 08/Mar/2009 B 56
3 08/Mar/2008 A
4 08/jan/2008 A
4 08/feb/2008 A
4 08/Mar/2008 A 14
4 08/Apr/2008 B
4 08/May/2008 B

I want the new data to look as follows:

ID prescribeddate drug quantity
1 08/jan/2008 A 28
1 08/feb/2008 A 14
1 08/Mar/2008 A 14
1 08/Apr/2008 B 56
1 08/May/2008 B 56
2 08/Jan/2008 A 30
2 08/Mar/2009 B 28
2 28/Mar/2008 A 30
2 08/Mar/2008 B 14
3 08/Jan/2008 B 56
3 08/Mar/2009 B 56
3 08/Mar/2008 A 28
4 08/jan/2008 A 14
4 08/feb/2008 A 14
4 08/Mar/2008 A 14
4 08/Apr/2008 B 28
4 08/May/2008 B 28

I believe i have made myself clear. Can anyone help. Thank you.

mkeintz · Posted 10-06-2017 08:14 PM

Editor's note: this is a popular topic. We've edited the solution to be more helpful for those facing a similar challenge. The first part here addresses the original question, and the second part addresses a more general LOCF (last observation carried forward) approach.

You apparently will go back one observation, but will go forward until a non-missing quantity is encountered, which was not how I understood your description. In other words, if the preceding observation is missing, then your task is a Next Observation Carried Back process.

I think the best way to approach this is to read an ID/DRUG group once and store quantity values into a array (_Q below). Then reread the group and when a missing quantity is encountered, read back from the array:

data want;

  array _q {0:10} _temporary_;
  do N=1 by 1 until (last.drug); 
    set have;
    by id drug notsorted;
    _q{N}=quantity;
  end;

  do I=1 to N;
    set have;
    do J= I-1 to N while (quantity=.);
      quantity=_q{J};
    end;
    if quantity=. then quantity=28;
    output;
  end;
  drop I J N;
run;

Notes that the lower bound of the _Q array should be set to 0, so that "do J=I-1 to ..." works when I=1.
Make sure the upper bound of array _Q will accommodate the longest series for single ID/DRUG group.

For the LOCF approach, @sarathannapareddy suggests the following.

Using 2 SET Statements in a single datastep:

Here is the code:

data l4;
 set l2;
 n=_n_;
 if missing(quantity) then
  do;
   do until (not missing(quantity));
     n=n-1;
     set l2(keep=quantity) point=n;  *second SET statement;
   end;
 end;
run;

For here's a comparison of the original data with the output for each of these approaches. The quantity_locf column is the result of @sarathannapareddy's LOCF approach. The quantity_next_rules column shows the output of @mkeintz's array approach that takes into account the OP's business rules.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

sfsdtegsdsdgdffhgfh · Posted 05-24-2010 09:22 AM

you can use retain function.

Howles · Posted 05-24-2010 11:51 AM

Take a look at http://www.sascommunity.org/wiki/Look-Ahead_and_Look-Back

sas_ · Posted 05-25-2010 03:21 AM

data l2;
input ID prescribeddate$ 3-14 drug$ quantity;
cards;
1 08/jan/2008 A 28
1 08/feb/2008 A 14
1 08/Mar/2008 A .
1 08/Apr/2008 B 56
1 08/May/2008 B .
2 08/Jan/2008 A 30
2 08/Mar/2009 B .
2 28/Mar/2008 A 30
2 08/Mar/2008 B 14
3 08/Jan/2008 B .
3 08/Mar/2009 B 56
3 08/Mar/2008 A .
4 08/jan/2008 A .
4 08/feb/2008 A .
4 08/Mar/2008 A 14
4 08/Apr/2008 B .
4 08/May/2008 B .
;
run;

data l3;
set l2;
retain tot;
if quantity>. then tot=quantity;
run;

omer2020 · Posted 10-05-2017 06:14 PM

I am amazed to see retain function can do that. I thought it is only used to keep variable (Columns) in first place in newly created data set. (when used after data step, before set). What I am trying to do is instead of retaining values after noon missing values, I want to replace missing values with the leading values;

what I have

Country ID Freq total;

CAD A 290 .

CAD B 48 .

CAD C 98 436

US A 3592 .

US B 157 .

US C 1018 4767

What I want;

tot

436

4767

can u suggest some thing for that please.

And thanks again for telling me something totally new

sravanece11 · Posted 08-18-2018 06:51 AM

Hi,

It worked.. but could you please explain why are we using ">." here and what is the use of it?

@sas_ wrote:
data l2;
input ID prescribeddate$ 3-14 drug$ quantity;
cards;
1 08/jan/2008 A 28
1 08/feb/2008 A 14
1 08/Mar/2008 A .
1 08/Apr/2008 B 56
1 08/May/2008 B .
2 08/Jan/2008 A 30
2 08/Mar/2009 B .
2 28/Mar/2008 A 30
2 08/Mar/2008 B 14
3 08/Jan/2008 B .
3 08/Mar/2009 B 56
3 08/Mar/2008 A .
4 08/jan/2008 A .
4 08/feb/2008 A .
4 08/Mar/2008 A 14
4 08/Apr/2008 B .
4 08/May/2008 B .
;
run;

data l3;
set l2;
retain tot;
if quantity>. then tot=quantity;
run;

PaigeMiller · Posted 08-18-2018 07:14 AM

Any actual numeric value is considered to be greater than the dot which stands for missing.

--
Paige Miller

mkeintz · Posted 08-18-2018 09:41 PM

The missing value (.) is treated by SAS as a less than all non-missing numeric values, whether positive or negative. In fact it is less than other missing values, such as .A, .B, ... through .Z, but not less than ._.

Most of the time people don't bother with special missing values. In such cases the expression

X > .

is effectively the same as

not(missing(x))

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sarathannapareddy · Posted 05-25-2010 11:23 AM

Using 2 SET Statements in a single datastep:

Here is the code:

data l4;
set l2;
n=_n_;
if missing(quantity) then do;
do until (not missing(quantity));
n=n-1;
set l2(keep=quantity) point=n; *second SET statement;
end;
end;
run;

Sarath Annapareddy
http://studysas.blogspot.com/2008/10/lag-function-how-to-obtain-information.html

MelissaM · Posted 11-09-2017 03:23 PM

@sarathannapareddy

This worked wonderfully, except I first had to sort in descending order (easy!).

Thank you!

Sandhya · Posted 05-26-2010 02:08 PM

data MasterRec;
do until(last.drug);
set Curr_data;
by drug quantity;
if last.quantity and not missing(quantity) then do;
q = quantity;
end;
if missing(quantity) then do;
quantity=q;
end;
output;
end;
run;

mkeintz · Posted 10-06-2017 08:14 PM

Editor's note: this is a popular topic. We've edited the solution to be more helpful for those facing a similar challenge. The first part here addresses the original question, and the second part addresses a more general LOCF (last observation carried forward) approach.

You apparently will go back one observation, but will go forward until a non-missing quantity is encountered, which was not how I understood your description. In other words, if the preceding observation is missing, then your task is a Next Observation Carried Back process.

I think the best way to approach this is to read an ID/DRUG group once and store quantity values into a array (_Q below). Then reread the group and when a missing quantity is encountered, read back from the array:

data want;

  array _q {0:10} _temporary_;
  do N=1 by 1 until (last.drug); 
    set have;
    by id drug notsorted;
    _q{N}=quantity;
  end;

  do I=1 to N;
    set have;
    do J= I-1 to N while (quantity=.);
      quantity=_q{J};
    end;
    if quantity=. then quantity=28;
    output;
  end;
  drop I J N;
run;

Notes that the lower bound of the _Q array should be set to 0, so that "do J=I-1 to ..." works when I=1.
Make sure the upper bound of array _Q will accommodate the longest series for single ID/DRUG group.

For the LOCF approach, @sarathannapareddy suggests the following.

Using 2 SET Statements in a single datastep:

Here is the code:

data l4;
 set l2;
 n=_n_;
 if missing(quantity) then
  do;
   do until (not missing(quantity));
     n=n-1;
     set l2(keep=quantity) point=n;  *second SET statement;
   end;
 end;
run;

For here's a comparison of the original data with the output for each of these approaches. The quantity_locf column is the result of @sarathannapareddy's LOCF approach. The quantity_next_rules column shows the output of @mkeintz's array approach that takes into account the OP's business rules.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away