BookmarkSubscribeRSS Feed
Cruise
Ammonite | Level 13

I'm working with a disease registry with "day" variable with around 50% missing. I'd like to impute for a missing half based on the frequency distribution pattern of non-missing half (values 1-31). Please see SAS proc freq on "day" variable.

 

Any hints or suggestions are appreciated. I have no prior experience of data imputation. Thanks in advance. Attached is the data in txt format. sas_help.jpg

 

 

proc import datafile=".......\have.txt"
out=have
dbms=tab replace;
getnames=yes;
guessingrows=max;
run;

data n.days; 
input day freq percent; 
cards;
1 12032 2.78 
2 6933 1.60  
3 6590 1.52  
4 6157 1.42  
5 6682 1.54  
6 6812 1.57  
07 6971 1.61 
08 7017 1.62 
09 6863 1.59 
10 7288 1.68 
11 7113 1.64 
12 7079 1.64 
13 7061 1.63 
14 6954 1.61 
15 11267 2.60
16 6936 1.60 
17 6810 1.57 
18 6827 1.58 
19 6859 1.59 
20 6996 1.62 
21 6803 1.57 
22 6731 1.56 
23 6748 1.56 
24 6452 1.49 
25 6297 1.46 
26 6282 1.45 
27 6448 1.49 
28 6579 1.52 
29 6092 1.41 
30 6235 1.44 
31 3642 0.84 
;
6 REPLIES 6
ballardw
Super User

Do you have any idea why nearly half of the values are missing? That might give you some clue such as "day" was supposed to be recorded but "Person X" was out of the office and it didn't get recorded. (Look for person X missing dates in another set and calculate "daY"). Or could be that less than a full "day" elapsed before what ever is measured occurred?

 

If that is supposed to be a "day of the month" value then you have restrictions based on a MONTH value with a maximum possible day value of 28, 29, 30 or 31 days. And the distribution isn't very reliable without that information. Which is likely why "31" is roughly half the value of any other given day except 1 and 15 (whose frequencies are enough larger than for others to tell me that a fair percentage of those records are guesstimate days already).

 

 

Cruise
Ammonite | Level 13

@ballardw

Thanks for asking! Missing in day of month decreased as quality of registry evolved over time. Day of month wasn't a required field until 1996. Good catch for that frequency for 1,15 and 31 are abnormal. I was wondering about to assign random numbers 1-31 to Blanks simplicity sake. The resulting data will be used for age-period-cohort. Lexis triangle approach required date variable authentic as possible.    

 

Year_of_diagnosis Day_of_diagnosis Freq
1976 Blank(s) 10916
1977 Blank(s) 11108
1978 Blank(s) 11086
1979 Blank(s) 11231
1980 Blank(s) 11593
1981 Blank(s) 11605
1982 Blank(s) 11611
1983 Blank(s) 11652
1984 Blank(s) 11655
1985 Blank(s) 11961
1986 Blank(s) 11772
1987 Blank(s) 11207
1988 Blank(s) 10875
1989 Blank(s) 10572
1990 Blank(s) 10410
1991 Blank(s) 10332
1992 Blank(s) 9884
1993 Blank(s) 8623
1994 Blank(s) 6623
1995 Blank(s) 3222
1996 Blank(s) 975
1997 Blank(s) 688
1998 Blank(s) 455
1999 Blank(s) 456
2000 Blank(s) 508
2001 Blank(s) 476
2002 Blank(s) 632
2003 Blank(s) 519
2004 Blank(s) 432
2005 Blank(s) 364
2006 Blank(s) 285
2007 Blank(s) 315
2008 Blank(s) 436
2009 Blank(s) 476
2010 Blank(s) 472
2011 Blank(s) 432
2012 Blank(s) 367
2013 Blank(s) 321
2014 Blank(s) 255
2015 Blank(s) 208
Reeza
Super User

Make sure to account for weekdays vs weekends and holidays. 

 

Cruise
Ammonite | Level 13
Well thought. However, my data spans all the way from 1976-2015. Considering weekends and holidays historically would make it a whole another story.
Reeza
Super User

Why? SAS has the HOLIDAY() and WEEKDAY() functions so you can determine holidays and weekdays relatively easily. 

Weekdays/ends is relatively easily for sure. 

ballardw
Super User

@Reeza wrote:

Why? SAS has the HOLIDAY() and WEEKDAY() functions so you can determine holidays and weekdays relatively easily. 

Weekdays/ends is relatively easily for sure. 


Excellent thought. Many organizations have issues with the "put off to Monday" or the person that knows what to do does not work evenings, holidays or weekends...

 

For @Cruise is there another date information attached to these records? One might think that since you have a "day_of_diagnosis" that there could be something related to either confirmation testing, treatment start or admittance. Which might give you an upper bound for use. And if there is something like a distribution between diagnosis and the test/treatment/admittance then you might have a model there.

Also would there be some other systemic data such as department, clinic, city available? It might be there is a fairly consistent delay based on one or more of those factors to use. With the age of some of your data you may also want to consider the actual year as transmission delays have (mostly) gone down with implementation of electronic messaging or records vs having to ship paper around (1976 to early 1990's at least).

 

If this were my data and I did have one or more other dates to compare I would calculate differences between the diagnosis date and those others and then run an analysis on those intervals by year and (department, hospital what ever). Then use the appropriate mean interval to back up to a presumed diagnostic date. Which might be adjusted by Holiday and weekday results if appropriate.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 663 views
  • 0 likes
  • 3 in conversation