I'm working with a disease registry with "day" variable with around 50% missing. I'd like to impute for a missing half based on the frequency distribution pattern of non-missing half (values 1-31). Please see SAS proc freq on "day" variable.
Any hints or suggestions are appreciated. I have no prior experience of data imputation. Thanks in advance. Attached is the data in txt format.
proc import datafile=".......\have.txt"
out=have
dbms=tab replace;
getnames=yes;
guessingrows=max;
run;
data n.days;
input day freq percent;
cards;
1 12032 2.78
2 6933 1.60
3 6590 1.52
4 6157 1.42
5 6682 1.54
6 6812 1.57
07 6971 1.61
08 7017 1.62
09 6863 1.59
10 7288 1.68
11 7113 1.64
12 7079 1.64
13 7061 1.63
14 6954 1.61
15 11267 2.60
16 6936 1.60
17 6810 1.57
18 6827 1.58
19 6859 1.59
20 6996 1.62
21 6803 1.57
22 6731 1.56
23 6748 1.56
24 6452 1.49
25 6297 1.46
26 6282 1.45
27 6448 1.49
28 6579 1.52
29 6092 1.41
30 6235 1.44
31 3642 0.84
;
Do you have any idea why nearly half of the values are missing? That might give you some clue such as "day" was supposed to be recorded but "Person X" was out of the office and it didn't get recorded. (Look for person X missing dates in another set and calculate "daY"). Or could be that less than a full "day" elapsed before what ever is measured occurred?
If that is supposed to be a "day of the month" value then you have restrictions based on a MONTH value with a maximum possible day value of 28, 29, 30 or 31 days. And the distribution isn't very reliable without that information. Which is likely why "31" is roughly half the value of any other given day except 1 and 15 (whose frequencies are enough larger than for others to tell me that a fair percentage of those records are guesstimate days already).
Thanks for asking! Missing in day of month decreased as quality of registry evolved over time. Day of month wasn't a required field until 1996. Good catch for that frequency for 1,15 and 31 are abnormal. I was wondering about to assign random numbers 1-31 to Blanks simplicity sake. The resulting data will be used for age-period-cohort. Lexis triangle approach required date variable authentic as possible.
Year_of_diagnosis | Day_of_diagnosis | Freq |
1976 | Blank(s) | 10916 |
1977 | Blank(s) | 11108 |
1978 | Blank(s) | 11086 |
1979 | Blank(s) | 11231 |
1980 | Blank(s) | 11593 |
1981 | Blank(s) | 11605 |
1982 | Blank(s) | 11611 |
1983 | Blank(s) | 11652 |
1984 | Blank(s) | 11655 |
1985 | Blank(s) | 11961 |
1986 | Blank(s) | 11772 |
1987 | Blank(s) | 11207 |
1988 | Blank(s) | 10875 |
1989 | Blank(s) | 10572 |
1990 | Blank(s) | 10410 |
1991 | Blank(s) | 10332 |
1992 | Blank(s) | 9884 |
1993 | Blank(s) | 8623 |
1994 | Blank(s) | 6623 |
1995 | Blank(s) | 3222 |
1996 | Blank(s) | 975 |
1997 | Blank(s) | 688 |
1998 | Blank(s) | 455 |
1999 | Blank(s) | 456 |
2000 | Blank(s) | 508 |
2001 | Blank(s) | 476 |
2002 | Blank(s) | 632 |
2003 | Blank(s) | 519 |
2004 | Blank(s) | 432 |
2005 | Blank(s) | 364 |
2006 | Blank(s) | 285 |
2007 | Blank(s) | 315 |
2008 | Blank(s) | 436 |
2009 | Blank(s) | 476 |
2010 | Blank(s) | 472 |
2011 | Blank(s) | 432 |
2012 | Blank(s) | 367 |
2013 | Blank(s) | 321 |
2014 | Blank(s) | 255 |
2015 | Blank(s) | 208 |
Make sure to account for weekdays vs weekends and holidays.
Why? SAS has the HOLIDAY() and WEEKDAY() functions so you can determine holidays and weekdays relatively easily.
Weekdays/ends is relatively easily for sure.
@Reeza wrote:
Why? SAS has the HOLIDAY() and WEEKDAY() functions so you can determine holidays and weekdays relatively easily.
Weekdays/ends is relatively easily for sure.
Excellent thought. Many organizations have issues with the "put off to Monday" or the person that knows what to do does not work evenings, holidays or weekends...
For @Cruise is there another date information attached to these records? One might think that since you have a "day_of_diagnosis" that there could be something related to either confirmation testing, treatment start or admittance. Which might give you an upper bound for use. And if there is something like a distribution between diagnosis and the test/treatment/admittance then you might have a model there.
Also would there be some other systemic data such as department, clinic, city available? It might be there is a fairly consistent delay based on one or more of those factors to use. With the age of some of your data you may also want to consider the actual year as transmission delays have (mostly) gone down with implementation of electronic messaging or records vs having to ship paper around (1976 to early 1990's at least).
If this were my data and I did have one or more other dates to compare I would calculate differences between the diagnosis date and those others and then run an analysis on those intervals by year and (department, hospital what ever). Then use the appropriate mean interval to back up to a presumed diagnostic date. Which might be adjusted by Holiday and weekday results if appropriate.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.