Dear SAS experts:
I'm trying to pick random 'day' when only 'day' field missing and 'day' and 'month' when both fields missing given all records have complete 'year'.
Variables:
Date of diagnosis (dx_month, dx_day and dx_year) where missing is denoted by 99.
Time sequence of diagnosis (seq_num) for multiple records of individuals
Censor (date of death or study cut off date if patient alive)
Assumptions/facts:
Selection rules:
If both month and day missing then pick random date between (‘01/01/known yyyy’, ‘12/31/known yyyy’) given:
Use the same logic for the unknown ‘day’ and known ‘month’ and ‘year’ records.
I have a not so elegant solution for one record per person data. But this doesn't help much for repeated row data.
%macro RandBetween(min, max); (&min + floor((1+&max-&min)*rand("uniform")))
%mend;
DATA HAVE1; SET HAVE;
RAND_DX_MONTH = %RandBetween(1,CENSOR_MONTH);
RAND_DX_DAY= %RandBetween(1,CENSOR_DAY);
RAND_DX=MDY(RAND_DX_MONTH, RAND_DX_DAY, DX_YEAR);
or
RAND_DX = %RandBetween(JAN/01/KNOWN_YEAR,CENSOR);
run;
data have;
input id dx_year dx_month dx_day seq_num CENSOR N_COUNT;
cards;
1 2009 3 10 2 19570 4
1 2010 9 7 3 19570 4
1 2010 9 7 4 19570 4
1 2012 99 99 5 19570 4
2 2004 4 6 1 20088 4
2 2008 99 99 2 20088 4
2 2010 3 23 3 20088 4
2 2012 4 2 4 20088 4
3 2005 1 24 1 19082 4
3 2005 2 22 2 19082 4
3 2005 99 99 3 19082 4
3 2012 99 99 4 19082 4
4 2005 99 99 1 19644 4
4 2005 99 99 2 19644 4
4 2006 99 99 3 19644 4
4 2012 11 13 4 19644 4
5 2011 8 1 1 19758 4
5 2012 99 99 2 19758 4
5 2013 99 99 3 19758 4
5 2014 1 13 4 19758 4
6 2009 3 25 1 18748 4
6 2011 99 99 2 18748 4
6 2011 99 99 3 18748 4
6 2011 99 99 4 18748 4
7 2005 2 99 1 20088 4
7 2006 7 99 2 20088 4
7 2010 5 4 3 20088 4
7 2010 8 3 4 20088 4
8 2005 7 12 2 17746 4
8 2005 8 99 1 17746 4
8 2006 11 1 3 17746 4
8 2008 7 16 4 17746 4
;
proc print data=have;
format censor date9.;
run;
Your data does not show any "missing" values for month or day.
Do you mean that 99 should be treated as "missing" for those variables?
You should provide some very clear examples of applying your rules 1,2, and 3.
You have sequence number that jumps from 3 to 60 for ID=6. Does that imply that there are actually 56 missing values that need to be filled in?
I have a very hard time feeling comfortable with generating pseudo random "diagnosis dates".
4.Regarding random pick for diagnosis, random selection will be used as one of multiple scenarios in my simulation study. Random selection will be compared to three other imputation strategy results and the gold standard data. In other words, I'm not running a model on the data or making inferential decisions based on the data using data with a randomly selected diagnosis.
Something to consider for your examples:
If the first random date for an Id with multiple records within a year with missing day and month is 31 Dec it isn't possible to get the following random date to be later in the year.
The more missing day month combinations within a year the more likely you can run into this situation.
Does the "class 2" have any affect on what you are doing such as hypothetical data like:
6 2009 3 25 1 18748 4 6 2011 99 99 2 18748 4 6 2011 99 99 3 18748 4 6 2011 99 99 60 18748 4 6 2011 99 99 4 18748 4 6 2011 99 99 5 18748 4
This has some potential with maintaining order of processing as without a date then the sequence within year may depend on sequence for some processing.
I'm going to guess that if the first sequences within a year are missing month and day but a higher sequence has month and day that the previous dates have to be generated as before the one with an actual date. Correct?
If that can happen you will need to address a similar issue related to bounds. Example: SEQ=3 has month =1 day =13, for random assignment of SEQ=1 we get month=1 (pretty forced in this case) and day =12. What is SEQ=2 going to get??
More fun ensues if you have multiple dates with missing scattered between them.
@ballardw thanks for pointing out the jump between 3 and 60. I realized that I supposed to subset my data to class1 and class2 and resolve them separately. Therefore, I'm editing 60 in the sample data.Thanks for the pointer so that this issue came to my attention to subset the data by class. Really, cases are totally different animals by the 'classes'.
How about:
Can you elaborate on this please? “more fun ensues if you have multiple dates with missing scattered between them”.
@Cruise wrote:
- SEQ1 and SEQ2 placed within the time window at equal or random distance in between.
Can you elaborate on this please? “more fun ensues if you have multiple dates with missing scattered between them”.
Your whole problem description has a class of issues sometimes referred to as "boundary values". Each possible boundary, i.e a defined date, between missing dates creates another set of "how to I squeeze potentially X values into Y slots" when Y provides fewer than X slots. Basically just the complexity involved with detecting and choosing the appropriate set rules. With different look ahead/ look behind intervals.
A semi-pathologic case possible in your data:
ID Year Month Day Seq
10 2015 8 8 4
10 2015 . . 5
10 2015 . . 6
10 2015 8 10 7
would request 2 dates between 8/8/2015 and 8/10/2015 when there is only one date possible.( or 5 days apart that need 6 or more insertions)
The part is that a clearly defined algorithm for addressing all of the possibly boundary conditions and then all of the rules to address each boundary condition set.
I don't have any slick ideas yet, just trying to get all of the rules set before coding anything.
Thanks for getting back to me. I really appreciate it and your case example.
ID=10 has 4 sequential records in the same month 2 days apart from each other. Records can take same date of diagnosis. Both SEQ5 and SEQ6 can take August 8th thru August 10th as long as SEQ6 is not earlier than SEQ 5. Same day occurrence would not violate the biological plausibility due to its chronic nature.
The problem is solved here but they used Mid_points. I want to use exact same approach but random selection where they're choosing midpoints of the possible time windows.
https://seer.cancer.gov/survivaltime/CalculateSurvivalTimeInMonths.sas
Reeza solved the problem for one record per row context. I edited the code to the fact that all records have correct year of diagnosis.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.