Hi Folks:
I'm trying to create random dates to replace incomplete dates with missing month and/or day. Year fields are complete. I used %Rand macro from Rick's tutorial https://blogs.sas.com/content/iml/2015/10/05/random-integers-sas.html. Conditions for the random selection include dates be not after censoring date and must be within the known year of diagnosis.
Could you please look at my SAS code below on the mock data where I created RAND_DX. Any critiques or suggestions or improvements appreciated to achieve my programming objective I stated above.
Thank you for taking your time in this.
My variables:
Date (dx_month, dx_day and dx_year)
Censor (censor_month, censor_day, censor_year)
Assumptions/facts:
DATA HAVE;
INPUT ID dx_year dx_month dx_day CENSOR CENSOR_DAY CENSOR_MONTH CENSOR_YEAR;
CARDS;
1 2005 99 99 20088 31 12 2014
2 2005 99 99 17308 22 5 2007
3 2005 99 99 16521 26 3 2005
4 2005 99 99 17074 30 9 2006
5 2005 99 99 16620 3 7 2005
6 2005 99 99 16709 30 9 2005
7 2005 99 99 16901 10 4 2006
8 2005 99 99 16777 7 12 2005
9 2005 99 99 16763 23 11 2005
10 2005 99 99 17750 6 8 2008
11 2005 99 99 16621 4 7 2005
12 2005 99 99 20088 31 12 2014
13 2005 99 99 16636 19 7 2005
14 2005 99 99 16685 6 9 2005
15 2005 99 99 16778 8 12 2005
16 2005 99 99 17129 24 11 2006
17 2005 99 99 20088 31 12 2014
18 2005 99 99 17057 13 9 2006
19 2005 99 99 16486 19 2 2005
20 2005 99 99 16548 22 4 2005
;
%macro RandBetween(min, max);
(&min + floor((1+&max-&min)*rand("uniform")))
%mend;
DATA HAVE1; SET HAVE;
RAND_DX_MONTH = %RandBetween(1,CENSOR_MONTH);
RAND_DX_DAY= %RandBetween(1,CENSOR_DAY);
RAND_DX=MDY(RAND_DX_MONTH, RAND_DX_DAY, DX_YEAR);
run;
SAS has the rand('integer', ....) function now so you don't need that macro unless you're on an older version of SAS.
Dates are integers. So you can calculate the earliest possible date for the 'event' and latest possible date and select a random value in-between them.
Your current methodology is more restrictive than this approach, for example, you can't have the event be within the same month as the censoring, which is possible. You may have a longer gap required based on what exactly you're measuring - diagnosis to censoring may be always at least 60 days or something like that.
For the day portion of the date, what happens if the day of censoring is the first of the month? From your current method that would leave a single day that can be chosen. It does not need to be less than the day, this is a parameter that needs some relationship to the month as well otherwise it's adding an additional restriction that may influence your data but not be accurate at all.
I haven't bothered to set any seeds, I'll leave that to you. I added in a check below to ensure that the event date is always before the censor date.
DATA HAVE;
INPUT ID dx_year dx_month dx_day CENSOR CENSOR_DAY CENSOR_MONTH CENSOR_YEAR;
CARDS;
1 2005 12 99 20088 31 12 2014
2 2005 1 99 17308 22 5 2007
3 2005 4 99 16521 26 3 2005
4 2005 99 99 17074 30 9 2006
5 2005 99 99 16620 3 7 2005
6 2005 99 99 16709 30 9 2005
7 2005 99 99 16901 10 4 2006
8 2005 99 99 16777 7 12 2005
9 2005 99 99 16763 23 11 2005
10 2005 99 99 17750 6 8 2008
11 2005 99 99 16621 4 7 2005
12 2005 99 99 20088 31 12 2014
13 2005 99 99 16636 19 7 2005
14 2005 99 99 16685 6 9 2005
15 2005 99 99 16778 8 12 2005
16 2005 99 99 17129 24 11 2006
17 2005 99 99 20088 31 12 2014
18 2005 99 99 17057 13 9 2006
19 2005 99 99 16486 19 2 2005
20 2005 99 99 16548 22 4 2005
;
data want;
set have;
*calculate censor date;
censor_date=mdy(censor_month, censor_day, censor_year);
*handle missing values for both day and month;
if dx_month = 99 and dx_day = 99 then
do;
*first possible date is the beginning of the year;
first_possible_date=intnx('year', censor_date, 0, 'b');
*last possible date is the day before censoring;
last_possible_date=censor_date - 1;
*event date is a random date between these dates;
event_date=rand('integer', first_possible_date, last_possible_date);
*assign day and months as needed;
dx_month=month(event_date);
dx_day=day(event_date);
end;
else if dx_day eq 99 and dx_month ne 99 then
do;
*first possible date is beginning of month indicated;
first_possible_date=mdy(censor_month, 1, censor_year);
*last possible date is last of month;
*EXCEPT if month is same as censor month;
*in that case, use the earlier of the two dates;
last_possible_date=min(intnx('month', first_possible_date, 0, 'e'),
censor_date - 1);
*calculate a random event date;
event_date=rand('integer', first_possible_date, last_possible_date);
*assign day as needed;
dx_day=day(event_date);
end;
else
event_date=mdy(dx_month, dx_day, dx_year);
*format dates for display and legibility;
format censor event_date first_possible_date last_possible_date date9.;
check = censor_date - event_date;
run;
PS.....if this is going to be published I'm going to want my name on the paper 😉
None of the records in your sample data set meet this sitatuion/assumption:
@Cruise wrote:
Hi Folks:
I'm trying to create random dates to replace incomplete dates with missing month and/or day. Year fields are complete. I used %Rand macro from Rick's tutorial https://blogs.sas.com/content/iml/2015/10/05/random-integers-sas.html. Conditions for the random selection include dates be not after censoring date and must be within the known year.
Could you please look at my SAS code below on the mock data where I created RAND_DX. Any critiques or suggestions or improvements appreciated to achieve my programming objective I stated above.
Thank you for taking your time in this.
My variables:
Date (dx_month, dx_day and dx_year)
Censor (censor_month, censor_day, censor_year)
Assumptions/facts:
- All records have ‘year’ of date (no missing)
- ‘day’ field is unknown when ‘month’ field is unknown
- ‘day’ field can be unknown when ‘month’ field is known
- Effect of leap year and February is negligible
- Variable ‘censor’ is complete
SAS has the rand('integer', ....) function now so you don't need that macro unless you're on an older version of SAS.
Dates are integers. So you can calculate the earliest possible date for the 'event' and latest possible date and select a random value in-between them.
Your current methodology is more restrictive than this approach, for example, you can't have the event be within the same month as the censoring, which is possible. You may have a longer gap required based on what exactly you're measuring - diagnosis to censoring may be always at least 60 days or something like that.
For the day portion of the date, what happens if the day of censoring is the first of the month? From your current method that would leave a single day that can be chosen. It does not need to be less than the day, this is a parameter that needs some relationship to the month as well otherwise it's adding an additional restriction that may influence your data but not be accurate at all.
I haven't bothered to set any seeds, I'll leave that to you. I added in a check below to ensure that the event date is always before the censor date.
DATA HAVE;
INPUT ID dx_year dx_month dx_day CENSOR CENSOR_DAY CENSOR_MONTH CENSOR_YEAR;
CARDS;
1 2005 12 99 20088 31 12 2014
2 2005 1 99 17308 22 5 2007
3 2005 4 99 16521 26 3 2005
4 2005 99 99 17074 30 9 2006
5 2005 99 99 16620 3 7 2005
6 2005 99 99 16709 30 9 2005
7 2005 99 99 16901 10 4 2006
8 2005 99 99 16777 7 12 2005
9 2005 99 99 16763 23 11 2005
10 2005 99 99 17750 6 8 2008
11 2005 99 99 16621 4 7 2005
12 2005 99 99 20088 31 12 2014
13 2005 99 99 16636 19 7 2005
14 2005 99 99 16685 6 9 2005
15 2005 99 99 16778 8 12 2005
16 2005 99 99 17129 24 11 2006
17 2005 99 99 20088 31 12 2014
18 2005 99 99 17057 13 9 2006
19 2005 99 99 16486 19 2 2005
20 2005 99 99 16548 22 4 2005
;
data want;
set have;
*calculate censor date;
censor_date=mdy(censor_month, censor_day, censor_year);
*handle missing values for both day and month;
if dx_month = 99 and dx_day = 99 then
do;
*first possible date is the beginning of the year;
first_possible_date=intnx('year', censor_date, 0, 'b');
*last possible date is the day before censoring;
last_possible_date=censor_date - 1;
*event date is a random date between these dates;
event_date=rand('integer', first_possible_date, last_possible_date);
*assign day and months as needed;
dx_month=month(event_date);
dx_day=day(event_date);
end;
else if dx_day eq 99 and dx_month ne 99 then
do;
*first possible date is beginning of month indicated;
first_possible_date=mdy(censor_month, 1, censor_year);
*last possible date is last of month;
*EXCEPT if month is same as censor month;
*in that case, use the earlier of the two dates;
last_possible_date=min(intnx('month', first_possible_date, 0, 'e'),
censor_date - 1);
*calculate a random event date;
event_date=rand('integer', first_possible_date, last_possible_date);
*assign day as needed;
dx_day=day(event_date);
end;
else
event_date=mdy(dx_month, dx_day, dx_year);
*format dates for display and legibility;
format censor event_date first_possible_date last_possible_date date9.;
check = censor_date - event_date;
run;
PS.....if this is going to be published I'm going to want my name on the paper 😉
Thank you Reeza! It's such a thorough approach!
Year of diagnosis is known for all records (dx_year). The goal is to impute missing month and/or missing day within the known year of diagnosis. Therefore, the first possible date is the beginning of the year of diagnosis instead the year of censoring.
Sorry for not including only-day missing scenario in the mock data. In the inclusive data, ID=3 was censored (03/26/2005) prior to a diagnosis (04/99/2005).
Below is the closest I got using your approach. Ignore, negative survival for ID=3. Please comment if you find another problem.
data have1; set have;
if dx_month = 99 and dx_day = 99 then fake_dx=mdy(1,1,dx_year);
else
if dx_day eq 99 and dx_month ne 99 then
fake_dx=mdy(dx_month,1,dx_year);
run;
data want;
set have1;
*calculate censor date;
censor_date=mdy(censor_month, censor_day, censor_year);
*handle missing values for both day and month;
if dx_month = 99 and dx_day = 99 then
do;
*first possible date is the beginning of the known year of diagnosis;
first_possible_date=intnx('year', fake_dx, 0, 'b');
*last possible date is the earlier of the end of the known year of diagnosis or censoring date;
last_possible_date=min(intnx('year', first_possible_date, 0, 'e'),
censor_date - 1);
*event date is a random date between these dates;
event_date=rand('integer', first_possible_date, last_possible_date);
*assign day and months as needed;
dx_month=month(event_date);
dx_day=day(event_date);
end;
else if dx_day eq 99 and dx_month ne 99 then
do;
*first possible date is beginning of the known month of diagnosis;
first_possible_date=mdy(dx_month, 1, dx_year);
*last possible date is last day of the of month;
*EXCEPT if month is same as censor month;
*in that case, use the earlier of the two dates;
last_possible_date=min(intnx('month', first_possible_date, 0, 'e'),
censor_date - 1);
*calculate a random event date;
event_date=rand('integer', first_possible_date, last_possible_date);
*assign day as needed;
dx_day=day(event_date);
end;
else
event_date=mdy(dx_month, dx_day, dx_year);
*format dates for display and legibility;
format censor event_date first_possible_date last_possible_date fake_dx date9.;
check = censor_date - event_date;
run;
proc print data=want(drop=censor_:);
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.