BookmarkSubscribeRSS Feed
Cruise
Ammonite | Level 13

Dear SAS experts:

 

I'm trying to pick random 'day' when only 'day' field missing and 'day' and 'month' when both fields missing given all records have complete 'year'.

 

Variables:

Date of diagnosis (dx_month, dx_day and dx_year) where missing is denoted by 99.

Time sequence of diagnosis (seq_num) for multiple records of individuals

Censor (date of death or study cut off date if patient alive)

 

Assumptions/facts:

  1. All records have ‘year’ of date (no missing)
  2. ‘day’ field is unknown when ‘month’ field is unknown
  3. unknown ‘day’ field and known ‘month’ field combination exits   
  4. Sequence_number (time sequence of diagnosis) is correct and successive integers within individual records.
  5. Effect of leap year and February is negligible
  6. Variable ‘censor’ complete (either date of death or study cut off date)
  7. Range of a number of multiple records per person is 1-12. The mock data was output to 4 records per person.  

Selection rules:

If both month and day missing then pick random date between (‘01/01/known yyyy’, ‘12/31/known yyyy’) given:

  1. not earlier than the date of previous record when applies;
  2. not later than the date of subsequent record when applies and censoring’ date;
  3. when missing in multiple records then fill in the earlier record first.
  4. If missing in ‘month’ and ‘day’ in N consecutive records within a year then possible time window is: 01/01/known year and min(12/31/known year and censor date). Create N random dates in the order of time sequence (seq_num) fitting the time window at random or equal distance from each other?  
  5. If the first sequences within a year are missing month and day but a higher sequence has month and day that the previous dates have to be generated as before the one with an actual date

Use the same logic for the unknown ‘day’ and known ‘month’ and ‘year’ records.  

 

I have a not so elegant solution for one record per person data. But this doesn't help much for repeated row data.

 

%macro RandBetween(min, max); (&min + floor((1+&max-&min)*rand("uniform")))
%mend;
DATA HAVE1; SET HAVE;
RAND_DX_MONTH = %RandBetween(1,CENSOR_MONTH);
RAND_DX_DAY= %RandBetween(1,CENSOR_DAY);
RAND_DX=MDY(RAND_DX_MONTH, RAND_DX_DAY, DX_YEAR);

or
RAND_DX = %RandBetween(JAN/01/KNOWN_YEAR,CENSOR); run;
data have;
input id dx_year dx_month dx_day seq_num CENSOR	N_COUNT;
cards;
1	2009	3	10	2	19570	4
1	2010	9	7	3	19570	4
1	2010	9	7	4	19570	4
1	2012	99	99	5	19570	4
2	2004	4	6	1	20088	4
2	2008	99	99	2	20088	4
2	2010	3	23	3	20088	4
2	2012	4	2	4	20088	4
3	2005	1	24	1	19082	4
3	2005	2	22	2	19082	4
3	2005	99	99	3	19082	4
3	2012	99	99	4	19082	4
4	2005	99	99	1	19644	4
4	2005	99	99	2	19644	4
4	2006	99	99	3	19644	4
4	2012	11	13	4	19644	4
5	2011	8	1	1	19758	4
5	2012	99	99	2	19758	4
5	2013	99	99	3	19758	4
5	2014	1	13	4	19758	4
6	2009	3	25	1	18748	4
6	2011	99	99	2	18748	4
6	2011	99	99	3	18748	4
6	2011	99	99	4	18748	4
7	2005	2	99	1	20088	4
7	2006	7	99	2	20088	4
7	2010	5	4	3	20088	4
7	2010	8	3	4	20088	4
8	2005	7	12	2	17746	4
8	2005	8	99	1	17746	4
8	2006	11	1	3	17746	4
8	2008	7	16	4	17746	4
;

proc print data=have;
format censor date9.;
run;

 

7 REPLIES 7
ballardw
Super User

Your data does not show any "missing" values for month or day.

Do you mean that 99 should be treated as "missing" for those variables?

 

You should provide some very clear examples of applying your rules 1,2, and 3.

 

You have sequence number that jumps from 3 to 60 for ID=6. Does that imply that there are actually 56 missing values that need to be filled in?

 

 

I have a very hard time feeling comfortable with generating pseudo random "diagnosis dates".

Cruise
Ammonite | Level 13

@ballardw 

1. Missing values are denoted to 99. Sorry for not making that clear.
2. I'll work on clear examples;
3. Sequence numbers take 0-10 for class 1 and above 60 for class 2. Which means that person with ID=6 has two different classes of conditions. But class1 precedes class 2, thus, 3 comes before 60. I should have simplified this in the sample data. It doesn't imply that there are actually 56 missing values that need to be filled in.

4.Regarding random pick for diagnosis, random selection will be used as one of multiple scenarios in my simulation study. Random selection will be compared to three other imputation strategy results and the gold standard data. In other words, I'm not running a model on the data or making inferential decisions based on the data using data with a randomly selected diagnosis. 

ballardw
Super User

Something to consider for your examples:

 

If the first random date for an Id with multiple records within a year with missing day and month is 31 Dec it isn't possible to get the following random date to be later in the year.

The more missing day month combinations within a year the more likely you can run into this situation.

 

Does the "class 2" have any affect on what you are doing such as hypothetical data like:

 

6	2009	3	25	1	18748	4
6	2011	99	99	2	18748	4
6	2011	99	99	3	18748	4
6	2011	99	99	60	18748	4
6	2011	99	99	4	18748	4
6	2011	99	99	5	18748	4

This has some potential with maintaining order of processing as without a date then the sequence within year may depend on sequence for some processing.

 

I'm going to guess that if the first sequences within a year are missing month and day but a higher sequence has month and day that the previous dates have to be generated as before the one with an actual date. Correct?

 

If that can happen you will need to address a similar issue related to bounds. Example: SEQ=3 has month =1 day =13, for random assignment of SEQ=1 we get month=1 (pretty forced in this case) and day =12. What is SEQ=2 going to get??

 

More fun ensues if you  have multiple dates with missing scattered between them.

Cruise
Ammonite | Level 13

@ballardw thanks for pointing out the jump between 3 and 60. I realized that I supposed to subset my data to class1 and class2 and  resolve them separately. Therefore, I'm editing 60 in the sample data.Thanks for the pointer so that this issue came to my attention to subset the data by class. Really, cases are totally different animals by the 'classes'.

Cruise
Ammonite | Level 13

@ballardw 

 

How about:

 

  • If missing in ‘month’ and ‘day’ in N consecutive records within a year then possible time window is: 01/01/known year and min(12/31/known year and censor date). Create N random dates in the order of time sequence (seq_num) fitting within the time window at random or equal distance from each other?  
  • Correct-If the first sequences within a year are missing month and day but a higher sequence has month and day that the previous dates have to be generated as before the one with an actual date.
  • Close dates: when multiple missing in a same year restricted to known dates of subsequent records then possible time window is 01/01/known year of SEQ1 and min (date of subsequent record, censoring date). The range between SEQ1 (Jan 1 of the known year) with missing and SEQ3 with known date will be calculated first to allocate random dates picked for SEQ1 and SEQ2 placed within the time window at equal or random distance in between.

Can you elaborate on this please? “more fun ensues if you have multiple dates with missing scattered between them”.

ballardw
Super User

@Cruise wrote:

@ballardw 

 

  • SEQ1 and SEQ2 placed within the time window at equal or random distance in between.

Can you elaborate on this please? “more fun ensues if you have multiple dates with missing scattered between them”.


Your whole problem description has a class of issues sometimes referred to as "boundary values". Each possible boundary, i.e a defined date, between missing dates creates another set of "how to I squeeze potentially X values into Y slots" when Y provides fewer than X slots. Basically just the complexity involved with detecting and choosing the appropriate set rules. With different look ahead/ look behind intervals.

 

A semi-pathologic case possible in your data:

ID Year   Month Day  Seq

10  2015 8          8      4

10  2015  .          .       5

10  2015  .          .       6

10  2015  8        10     7

 

would request 2 dates between 8/8/2015 and 8/10/2015 when there is only one date possible.( or 5 days apart that need 6 or more insertions)

The part is that a clearly defined algorithm for addressing all of the possibly boundary conditions and then all of the rules to address each boundary condition set.

 

I don't have any slick ideas yet, just trying to get all of the rules set before coding anything.

Cruise
Ammonite | Level 13

@ballardw 

 

Thanks for getting back to me. I really appreciate it and your case example.

 

ID=10 has 4 sequential records in the same month 2 days apart from each other. Records can take same date of diagnosis. Both SEQ5 and SEQ6 can take August 8th thru August 10th as long as SEQ6 is not earlier than SEQ 5. Same day occurrence would not violate the biological plausibility due to its chronic nature.

 

The problem is solved here but they used Mid_points. I want to use exact same approach but random selection where they're choosing midpoints of the possible time windows.

 

https://seer.cancer.gov/survivaltime/CalculateSurvivalTimeInMonths.sas

 

Reeza solved the problem for one record per row context. I edited the code to the fact that all records have correct year of diagnosis.

https://communities.sas.com/t5/SAS-Programming/Pick-random-values-conditional-to-other-variables/m-p...

@Reeza 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 915 views
  • 1 like
  • 2 in conversation