Help using Base SAS procedures

Fill missing values in a row with values from the nearest observation with non-blank values

Reply
Contributor
Posts: 28

Fill missing values in a row with values from the nearest observation with non-blank values

Hi!

I have a dataset as follows:

PatientIDYearDiagADiagBDiagC
12010...
12011.1.
12013...
120141..

If, for an observation of a given patient, I have missing values for all cells in DiagA-DiagC, I want SAS to insert the values from the previous observation (of the same patient) with non-missing values in DiagA-DiagC. As shown in the example above, sometimes the first observation of a patient has missing values. In this case, I want SAS to insert the values from the nearest observation (of the same patient) that have non-missing values. The table should look like this:

PatientIDYearDiagADiagBDiagC
12010.1.
12011.1.
12013.1.
120141..
Super User
Posts: 6,936

Re: Fill missing values in a row with values from the nearest observation with non-blank values

So you only want to expand the non-missing value into the previous and following year, but not any farther?

 

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Super User
Posts: 7,401

Re: Fill missing values in a row with values from the nearest observation with non-blank values

[ Edited ]

Sorry, I am struggling to see the logic behind this.  Your data basically resolves to:

patientid year  diag  result

1            2011 B      1

1            2014 A      1

 

So what is the logical answer to these questions:

Where does the 2010 come from?

Why should diagb be imputed to be the previous year and next year, why not diaga, that could just as easily populate 2013?

To my mind you would be better off getting your data - i.e. actual data, not missing data, into a normalised structure and then populating out imputation rules from there

 

To add, this is why CDISC strucutres are normalised.  Once you add in reference dates, it is very easy to apply imputation rules on cascading data, i.e. refstart to row date = needs to be populated, then row date to refend needs to be populated.

 

, something like:

data have;
  input PatientID	Year DiagA DiagB DiagC;
datalines;
1	2010	.	.	.
1	2011	.	1	.
1	2013	.	.	.
1	2014	1	.	.
;
run;

data want (drop=diaga diagb diagc);
  set have;
  if diaga ne . then do; diag="A"; result=diaga; output; end;
  if diagb ne . then do; diag="B"; result=diagb; output; end;
  if diagc ne . then do; diag="C"; result=diagc; output; end;
run;

/* Note this just expands the data based on the difference in years + 1, the logic will change here based on your imputation rules */
data want;
  set want;
  retain start;
  if _n_=1 then start=2010;
  do i=start to year +1;
    year=i;
    output;
  end;
  start=year;
run;

 

Contributor
Posts: 28

Re: Fill missing values in a row with values from the nearest observation with non-blank values

Hello,

Some observations of some patients have missing values in DiagA, DiagB and DiagC. In those cases, I want to insert the values of the previous observation of the same patient where the diagnosis values are available, because those values are a proxy for current diagnosis.

For some patients, the first few observations of DiagA, DiagB and DiagC have missing values. In those cases I want to insert the values from the nearest observation where these values are available, which then would be an observation from a later point in time.
Super User
Posts: 6,936

Re: Fill missing values in a row with values from the nearest observation with non-blank values

How about this:

 

First Step: use the dataset as ordered. retain non-missing values from a certain record for as long as missing values are encountered, and also set an incrementing counter (which will tell you later how "far off" you are). Put values and counters into new variables.

Second Step: reverse the order of the dataset, and repeat what you did in step one.

Third step: put dataset back into original order. Then go through it, and replace missing values with either the "up" or "down" retained values, according to which method has the lower count. Drop the helper variables.

 

If that sounds right from the logic, I would supply example code.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 28

Re: Fill missing values in a row with values from the nearest observation with non-blank values

I realize it looks bad inserting B into 2013. However, I do have an additional variable that counts the number of days since the last observation. In that case I could just choose the values from whichever observation is closest in time.
Occasional Contributor
Posts: 7

Re: Fill missing values in a row with values from the nearest observation with non-blank values

Does this get you close?

 

 

 

data patients ;
infile datalines dsd delimiter=','; 
input patientID $ year $ diagA $ diagB $ diagC $ ;
datalines;
1,2010,,,.
1,2011,,1,1
1,2013,,,
1,2014,1,, 
;
run;
title1 'patients' ;
proc print data = patients ;
run;
/*
* get extreme values for each patient
*/
proc sql ;
create table patients_defaults as
select patientID
, prxchange('s/(\d+),\d+/$1/' , -1 , max(case when diagA is not null then CATX(',',year,diagA) else ' ' end) ) AS diagA_default_year
, prxchange('s/\d+,(\d+)/$1/' , -1 , max(case when diagA is not null then CATX(',',year,diagA) else ' ' end) ) AS diagA_default
, prxchange('s/(\d+),\d+/$1/' , -1 , max(case when diagB is not null then CATX(',',year,diagB) else ' ' end) ) AS diagB_default_year
, prxchange('s/\d+,(\d+)/$1/' , -1 , max(case when diagB is not null then CATX(',',year,diagB) else ' ' end) ) AS diagB_default
, prxchange('s/(\d+),\d+/$1/' , -1 , max(case when diagC is not null then CATX(',',year,diagC) else ' ' end) ) AS diagC_default_year
, prxchange('s/\d+,(\d+)/$1/', -1 , max(case when diagC is not null then CATX(',',year,diagC) else ' ' end) ) AS diagC_default
from patients
group by patientID ;
quit;
run;
title1 'patient defaults' ;
proc print data = patients_defaults ;
run;
/*
* apply extreme values to each patient detail 
*/
title1 'patients with defaults ' ; 
proc sql ;
select p.patientID
, p.year
, case when p.diagA is null then d.diagA_default else p.diagA end AS diagA
, case when p.diagA is null then d.diagA_default_year else p.year end AS diagA_year
, case when p.diagb is null then d.diagb_default else p.diagB end AS diagB
, case when p.diagB is null then d.diagB_default_year else p.year end AS diagB_year
, case when p.diagb is null then d.diagc_default else p.diagC end AS diagC
, case when p.diagC is null then d.diagC_default_year else p.year end AS diagC_year
from patients p
join patients_defaults d
on p.patientID = d.patientID
order by 1,2
;
quit;
run;
Contributor
Posts: 28

Re: Fill missing values in a row with values from the nearest observation with non-blank values

Hi,

 

Is there any easier way of doing this if I restrict myself to looking back in time? So, in my example, the first observation would continue to have missing values, while the third observation would assume the values from the second observation.

Occasional Contributor
Posts: 7

Re: Fill missing values in a row with values from the nearest observation with non-blank values

 

No matter what else, when you wish to find extreme values and apply them back to the original data, you have to match the data to a version of itself.  I think that applying global extremes (eg MAX, MIN, SUM, etc.), is easier than local extremes (eg, MAX within 1, 2, or 3 records distance).  

 

I suppose you could mimic this by dancing back and forth through a dataset, using point, or else doing a one-to-one merge of the data with  a copy of itself offest by one record.  But you may not totally eliminate missing values that way.

 

'Easier' might be matter of taste.

Super User
Posts: 9,676

Re: Fill missing values in a row with values from the nearest observation with non-blank values

There are a lot scenario you need to consider about .

 

 

data have;
infile cards expandtabs truncover;
  input PatientID	Year DiagA DiagB DiagC;
datalines;
1	2010	.	.	.
1	2011	.	1	.
1	2013	.	.	.
1	2014	1	.	.
2	2010	1	.	.
2	2011	.	1	.
2	2013	.	.	.
2	2014	1	.	.
;
run;

data temp;
 set have;
 by PatientID	;
 retain found;
 if first.PatientID then found=.;
 if nmiss(DiagA, DiagB ,DiagC) ne 3 and not found then do;output;found=1;end;
keep PatientID	DiagA DiagB DiagC;
rename DiagA=A DiagB=B DiagC=C;
run;
data want;
 merge have temp;
 by PatientID	;
 retain found _DiagA _DiagB _DiagC;
 if first.PatientID then call missing(found,_DiagA, _DiagB ,_DiagC);
 _DiagA=coalesce(DiagA,_DiagA);
 _DiagB=coalesce(DiagB,_DiagB);
 _DiagC=coalesce(DiagC,_DiagC);
 if nmiss(DiagA, DiagB ,DiagC) eq 3 then do;
  DiagA=_DiagA;DiagB=_DiagB;DiagC=_DiagC;
  if not found then do;
    DiagA=A; DiagB=B; DiagC=C;
  end;
 end;
 else found=1;
 drop A B C _: found;
run;
Contributor
Posts: 28

Re: Fill missing values in a row with values from the nearest observation with non-blank values

Hello again,

 

I changed my mind. What I want to do now should be more simple. For observations where diagnosis is missing, I want SAS to insert the most recent non-blank diagnosis value for that particular patient, as outlined in the JPEG image I attached at the bottom of this post.

 

I tried the following code:

 

data want;
set have;
retain _diagnosis;
if not missing(diagnosis) then _diagnosis=diagnosis;
else diagnosis=_diagnosis;
drop _diagnosis;
run;

 

 

The problem with my code is that it doesn't take into account the patient. So, for the table below, SAS would insert A into the first observation of patient 3, although that diagnosis belongs to patient 2. I need your help in making this code conditional on patient.

 

 

example.jpg

Super User
Super User
Posts: 7,401

Re: Fill missing values in a row with values from the nearest observation with non-blank values

Yes, thats a lot easier, see normalised data structures are easier to work with Smiley Surprised)

data have;
  infile datalines missover;
  input patientid diagnosis $;
datalines;
1 A
1
1
2 A
2 A
2 
3
3 B
3
;
run;

data want (drop=diag);
  set have;
  retain diag;
  if lag(patientid)=patientid and diagnosis="" then diagnosis=diag;
  diag=diagnosis;
run;

Do note it is a lot easier for us if you put test data in the form of a datastep in the post rather than attachments, pictures etc. which we then have to type in by hand.

Ask a Question
Discussion stats
  • 11 replies
  • 1653 views
  • 3 likes
  • 5 in conversation