PatientID | Year | DiagA | DiagB | DiagC |
1 | 2010 | . | . | . |
1 | 2011 | . | 1 | . |
1 | 2013 | . | . | . |
1 | 2014 | 1 | . | . |
PatientID | Year | DiagA | DiagB | DiagC |
1 | 2010 | . | 1 | . |
1 | 2011 | . | 1 | . |
1 | 2013 | . | 1 | . |
1 | 2014 | 1 | . | . |
So you only want to expand the non-missing value into the previous and following year, but not any farther?
Sorry, I am struggling to see the logic behind this. Your data basically resolves to:
patientid year diag result
1 2011 B 1
1 2014 A 1
So what is the logical answer to these questions:
Where does the 2010 come from?
Why should diagb be imputed to be the previous year and next year, why not diaga, that could just as easily populate 2013?
To my mind you would be better off getting your data - i.e. actual data, not missing data, into a normalised structure and then populating out imputation rules from there
To add, this is why CDISC strucutres are normalised. Once you add in reference dates, it is very easy to apply imputation rules on cascading data, i.e. refstart to row date = needs to be populated, then row date to refend needs to be populated.
, something like:
data have; input PatientID Year DiagA DiagB DiagC; datalines; 1 2010 . . . 1 2011 . 1 . 1 2013 . . . 1 2014 1 . . ; run; data want (drop=diaga diagb diagc); set have; if diaga ne . then do; diag="A"; result=diaga; output; end; if diagb ne . then do; diag="B"; result=diagb; output; end; if diagc ne . then do; diag="C"; result=diagc; output; end; run; /* Note this just expands the data based on the difference in years + 1, the logic will change here based on your imputation rules */ data want; set want; retain start; if _n_=1 then start=2010; do i=start to year +1; year=i; output; end; start=year; run;
How about this:
First Step: use the dataset as ordered. retain non-missing values from a certain record for as long as missing values are encountered, and also set an incrementing counter (which will tell you later how "far off" you are). Put values and counters into new variables.
Second Step: reverse the order of the dataset, and repeat what you did in step one.
Third step: put dataset back into original order. Then go through it, and replace missing values with either the "up" or "down" retained values, according to which method has the lower count. Drop the helper variables.
If that sounds right from the logic, I would supply example code.
Does this get you close?
data patients ;
infile datalines dsd delimiter=',';
input patientID $ year $ diagA $ diagB $ diagC $ ;
datalines;
1,2010,,,.
1,2011,,1,1
1,2013,,,
1,2014,1,,
;
run;
title1 'patients' ;
proc print data = patients ;
run;
/*
* get extreme values for each patient
*/
proc sql ;
create table patients_defaults as
select patientID
, prxchange('s/(\d+),\d+/$1/' , -1 , max(case when diagA is not null then CATX(',',year,diagA) else ' ' end) ) AS diagA_default_year
, prxchange('s/\d+,(\d+)/$1/' , -1 , max(case when diagA is not null then CATX(',',year,diagA) else ' ' end) ) AS diagA_default
, prxchange('s/(\d+),\d+/$1/' , -1 , max(case when diagB is not null then CATX(',',year,diagB) else ' ' end) ) AS diagB_default_year
, prxchange('s/\d+,(\d+)/$1/' , -1 , max(case when diagB is not null then CATX(',',year,diagB) else ' ' end) ) AS diagB_default
, prxchange('s/(\d+),\d+/$1/' , -1 , max(case when diagC is not null then CATX(',',year,diagC) else ' ' end) ) AS diagC_default_year
, prxchange('s/\d+,(\d+)/$1/', -1 , max(case when diagC is not null then CATX(',',year,diagC) else ' ' end) ) AS diagC_default
from patients
group by patientID ;
quit;
run;
title1 'patient defaults' ;
proc print data = patients_defaults ;
run;
/*
* apply extreme values to each patient detail
*/
title1 'patients with defaults ' ;
proc sql ;
select p.patientID
, p.year
, case when p.diagA is null then d.diagA_default else p.diagA end AS diagA
, case when p.diagA is null then d.diagA_default_year else p.year end AS diagA_year
, case when p.diagb is null then d.diagb_default else p.diagB end AS diagB
, case when p.diagB is null then d.diagB_default_year else p.year end AS diagB_year
, case when p.diagb is null then d.diagc_default else p.diagC end AS diagC
, case when p.diagC is null then d.diagC_default_year else p.year end AS diagC_year
from patients p
join patients_defaults d
on p.patientID = d.patientID
order by 1,2
;
quit;
run;
Hi,
Is there any easier way of doing this if I restrict myself to looking back in time? So, in my example, the first observation would continue to have missing values, while the third observation would assume the values from the second observation.
No matter what else, when you wish to find extreme values and apply them back to the original data, you have to match the data to a version of itself. I think that applying global extremes (eg MAX, MIN, SUM, etc.), is easier than local extremes (eg, MAX within 1, 2, or 3 records distance).
I suppose you could mimic this by dancing back and forth through a dataset, using point, or else doing a one-to-one merge of the data with a copy of itself offest by one record. But you may not totally eliminate missing values that way.
'Easier' might be matter of taste.
There are a lot scenario you need to consider about .
data have;
infile cards expandtabs truncover;
input PatientID Year DiagA DiagB DiagC;
datalines;
1 2010 . . .
1 2011 . 1 .
1 2013 . . .
1 2014 1 . .
2 2010 1 . .
2 2011 . 1 .
2 2013 . . .
2 2014 1 . .
;
run;
data temp;
set have;
by PatientID ;
retain found;
if first.PatientID then found=.;
if nmiss(DiagA, DiagB ,DiagC) ne 3 and not found then do;output;found=1;end;
keep PatientID DiagA DiagB DiagC;
rename DiagA=A DiagB=B DiagC=C;
run;
data want;
merge have temp;
by PatientID ;
retain found _DiagA _DiagB _DiagC;
if first.PatientID then call missing(found,_DiagA, _DiagB ,_DiagC);
_DiagA=coalesce(DiagA,_DiagA);
_DiagB=coalesce(DiagB,_DiagB);
_DiagC=coalesce(DiagC,_DiagC);
if nmiss(DiagA, DiagB ,DiagC) eq 3 then do;
DiagA=_DiagA;DiagB=_DiagB;DiagC=_DiagC;
if not found then do;
DiagA=A; DiagB=B; DiagC=C;
end;
end;
else found=1;
drop A B C _: found;
run;
Hello again,
I changed my mind. What I want to do now should be more simple. For observations where diagnosis is missing, I want SAS to insert the most recent non-blank diagnosis value for that particular patient, as outlined in the JPEG image I attached at the bottom of this post.
I tried the following code:
data want;
set have;
retain _diagnosis;
if not missing(diagnosis) then _diagnosis=diagnosis;
else diagnosis=_diagnosis;
drop _diagnosis;
run;
The problem with my code is that it doesn't take into account the patient. So, for the table below, SAS would insert A into the first observation of patient 3, although that diagnosis belongs to patient 2. I need your help in making this code conditional on patient.
Yes, thats a lot easier, see normalised data structures are easier to work with 🐵
data have; infile datalines missover; input patientid diagnosis $; datalines; 1 A 1 1 2 A 2 A 2 3 3 B 3 ; run; data want (drop=diag); set have; retain diag; if lag(patientid)=patientid and diagnosis="" then diagnosis=diag; diag=diagnosis; run;
Do note it is a lot easier for us if you put test data in the form of a datastep in the post rather than attachments, pictures etc. which we then have to type in by hand.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.