Re: Select random values to replace missing in sequential multiple rec...

Cruise · Posted 06-21-2019 04:13 PM

Dear SAS experts:

I'm trying to pick random 'day' when only 'day' field missing and 'day' and 'month' when both fields missing given all records have complete 'year'.

Variables:

Date of diagnosis (dx_month, dx_day and dx_year) where missing is denoted by 99.

Time sequence of diagnosis (seq_num) for multiple records of individuals

Censor (date of death or study cut off date if patient alive)

Assumptions/facts:

All records have ‘year’ of date (no missing)
‘day’ field is unknown when ‘month’ field is unknown
unknown ‘day’ field and known ‘month’ field combination exits
Sequence_number (time sequence of diagnosis) is correct and successive integers within individual records.
Effect of leap year and February is negligible
Variable ‘censor’ complete (either date of death or study cut off date)
Range of a number of multiple records per person is 1-12. The mock data was output to 4 records per person.

Selection rules:

If both month and day missing then pick random date between (‘01/01/known yyyy’, ‘12/31/known yyyy’) given:

not earlier than the date of previous record when applies;
not later than the date of subsequent record when applies and censoring’ date;
when missing in multiple records then fill in the earlier record first.
If missing in ‘month’ and ‘day’ in N consecutive records within a year then possible time window is: 01/01/known year and min(12/31/known year and censor date). Create N random dates in the order of time sequence (seq_num) fitting the time window at random or equal distance from each other?
If the first sequences within a year are missing month and day but a higher sequence has month and day that the previous dates have to be generated as before the one with an actual date

Use the same logic for the unknown ‘day’ and known ‘month’ and ‘year’ records.

I have a not so elegant solution for one record per person data. But this doesn't help much for repeated row data.

%macro RandBetween(min, max); (&min + floor((1+&max-&min)*rand("uniform")))
%mend;
DATA HAVE1; SET HAVE;
RAND_DX_MONTH = %RandBetween(1,CENSOR_MONTH);
RAND_DX_DAY= %RandBetween(1,CENSOR_DAY);
RAND_DX=MDY(RAND_DX_MONTH, RAND_DX_DAY, DX_YEAR);

or 
RAND_DX = %RandBetween(JAN/01/KNOWN_YEAR,CENSOR); 
run;

data have;
input id dx_year dx_month dx_day seq_num CENSOR	N_COUNT;
cards;
1	2009	3	10	2	19570	4
1	2010	9	7	3	19570	4
1	2010	9	7	4	19570	4
1	2012	99	99	5	19570	4
2	2004	4	6	1	20088	4
2	2008	99	99	2	20088	4
2	2010	3	23	3	20088	4
2	2012	4	2	4	20088	4
3	2005	1	24	1	19082	4
3	2005	2	22	2	19082	4
3	2005	99	99	3	19082	4
3	2012	99	99	4	19082	4
4	2005	99	99	1	19644	4
4	2005	99	99	2	19644	4
4	2006	99	99	3	19644	4
4	2012	11	13	4	19644	4
5	2011	8	1	1	19758	4
5	2012	99	99	2	19758	4
5	2013	99	99	3	19758	4
5	2014	1	13	4	19758	4
6	2009	3	25	1	18748	4
6	2011	99	99	2	18748	4
6	2011	99	99	3	18748	4
6	2011	99	99	4	18748	4
7	2005	2	99	1	20088	4
7	2006	7	99	2	20088	4
7	2010	5	4	3	20088	4
7	2010	8	3	4	20088	4
8	2005	7	12	2	17746	4
8	2005	8	99	1	17746	4
8	2006	11	1	3	17746	4
8	2008	7	16	4	17746	4
;

proc print data=have;
format censor date9.;
run;

ballardw · Posted 06-21-2019 04:38 PM

Your data does not show any "missing" values for month or day.

Do you mean that 99 should be treated as "missing" for those variables?

You should provide some very clear examples of applying your rules 1,2, and 3.

You have sequence number that jumps from 3 to 60 for ID=6. Does that imply that there are actually 56 missing values that need to be filled in?

I have a very hard time feeling comfortable with generating pseudo random "diagnosis dates".

Cruise · Posted 06-21-2019 05:04 PM

@ballardw

1. Missing values are denoted to 99. Sorry for not making that clear.

2. I'll work on clear examples;

3. Sequence numbers take 0-10 for class 1 and above 60 for class 2. Which means that person with ID=6 has two different classes of conditions. But class1 precedes class 2, thus, 3 comes before 60. I should have simplified this in the sample data. It doesn't imply that there are actually 56 missing values that need to be filled in.

4.Regarding random pick for diagnosis, random selection will be used as one of multiple scenarios in my simulation study. Random selection will be compared to three other imputation strategy results and the gold standard data. In other words, I'm not running a model on the data or making inferential decisions based on the data using data with a randomly selected diagnosis.

ballardw · Posted 06-21-2019 05:29 PM

Something to consider for your examples:

If the first random date for an Id with multiple records within a year with missing day and month is 31 Dec it isn't possible to get the following random date to be later in the year.

The more missing day month combinations within a year the more likely you can run into this situation.

Does the "class 2" have any affect on what you are doing such as hypothetical data like:

6	2009	3	25	1	18748	4
6	2011	99	99	2	18748	4
6	2011	99	99	3	18748	4
6	2011	99	99	60	18748	4
6	2011	99	99	4	18748	4
6	2011	99	99	5	18748	4

This has some potential with maintaining order of processing as without a date then the sequence within year may depend on sequence for some processing.

I'm going to guess that if the first sequences within a year are missing month and day but a higher sequence has month and day that the previous dates have to be generated as before the one with an actual date. Correct?

If that can happen you will need to address a similar issue related to bounds. Example: SEQ=3 has month =1 day =13, for random assignment of SEQ=1 we get month=1 (pretty forced in this case) and day =12. What is SEQ=2 going to get??

More fun ensues if you have multiple dates with missing scattered between them.

Cruise · Posted 06-21-2019 05:30 PM

@ballardw thanks for pointing out the jump between 3 and 60. I realized that I supposed to subset my data to class1 and class2 and resolve them separately. Therefore, I'm editing 60 in the sample data.Thanks for the pointer so that this issue came to my attention to subset the data by class. Really, cases are totally different animals by the 'classes'.

Cruise · Posted 06-21-2019 06:22 PM

@ballardw

How about:

If missing in ‘month’ and ‘day’ in N consecutive records within a year then possible time window is: 01/01/known year and min(12/31/known year and censor date). Create N random dates in the order of time sequence (seq_num) fitting within the time window at random or equal distance from each other?
Correct-If the first sequences within a year are missing month and day but a higher sequence has month and day that the previous dates have to be generated as before the one with an actual date.
Close dates: when multiple missing in a same year restricted to known dates of subsequent records then possible time window is 01/01/known year of SEQ1 and min (date of subsequent record, censoring date). The range between SEQ1 (Jan 1 of the known year) with missing and SEQ3 with known date will be calculated first to allocate random dates picked for SEQ1 and SEQ2 placed within the time window at equal or random distance in between.

Can you elaborate on this please? “more fun ensues if you have multiple dates with missing scattered between them”.

ballardw · Posted 06-24-2019 11:22 AM

@Cruise wrote:

@ballardw

SEQ1 and SEQ2 placed within the time window at equal or random distance in between.

Can you elaborate on this please? “more fun ensues if you have multiple dates with missing scattered between them”.

Your whole problem description has a class of issues sometimes referred to as "boundary values". Each possible boundary, i.e a defined date, between missing dates creates another set of "how to I squeeze potentially X values into Y slots" when Y provides fewer than X slots. Basically just the complexity involved with detecting and choosing the appropriate set rules. With different look ahead/ look behind intervals.

A semi-pathologic case possible in your data:

ID Year Month Day Seq

10 2015 8 8 4

10 2015 . . 5

10 2015 . . 6

10 2015 8 10 7

would request 2 dates between 8/8/2015 and 8/10/2015 when there is only one date possible.( or 5 days apart that need 6 or more insertions)

The part is that a clearly defined algorithm for addressing all of the possibly boundary conditions and then all of the rules to address each boundary condition set.

I don't have any slick ideas yet, just trying to get all of the rules set before coding anything.

Cruise · Posted 06-24-2019 01:08 PM

@ballardw

Thanks for getting back to me. I really appreciate it and your case example.

ID=10 has 4 sequential records in the same month 2 days apart from each other. Records can take same date of diagnosis. Both SEQ5 and SEQ6 can take August 8th thru August 10th as long as SEQ6 is not earlier than SEQ 5. Same day occurrence would not violate the biological plausibility due to its chronic nature.

The problem is solved here but they used Mid_points. I want to use exact same approach but random selection where they're choosing midpoints of the possible time windows.

https://seer.cancer.gov/survivaltime/CalculateSurvivalTimeInMonths.sas

Reeza solved the problem for one record per row context. I edited the code to the fact that all records have correct year of diagnosis.

https://communities.sas.com/t5/SAS-Programming/Pick-random-values-conditional-to-other-variables/m-p...

@Reeza

Select random values to replace missing in sequential multiple records per person data

Re: Select random values to replace missing in sequential multiple records per person data

Re: Select random values to replace missing in sequential multiple records per person data

Re: Select random values to replace missing in sequential multiple records per person data

Re: Select random values to replace missing in sequential multiple records per person data

Re: Select random values to replace missing in sequential multiple records per person data

Re: Select random values to replace missing in sequential multiple records per person data

Re: Select random values to replace missing in sequential multiple records per person data

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away