SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Working with incomplete dates

Accepted Solution Solved
Reply
Contributor
Posts: 48
Accepted Solution

Working with incomplete dates

[ Edited ]

Hello,

 

I was given a dataset where are the dates are originally in character format. To my dismay, I just discovered that not all the dates in my dataset are complete. While the majority (I hope) of the dates include a year, month, and day, some only include month, and some only include year, for example:

 

20070311

200812..

2009....

 

This is a really big problem for me since the dates comprise my study's desired outcome. In order to troubleshoot this, is there a way to convert dates in character form to formatted dates in three different forms (year+month+day, year+month, and year only)? That way if the date converts to a missing value in one format, I can crosscheck to see if it is because it is an incomplete date. Alternatively I was reading about ways to deal with incomplete dates in SAS but I am worried that if I don't understand the process, I will introduce mistakes into my dataset.

 

I tried something like this, but this code needs to at least be tweaked before it will work, and I am not sure it is the best solution:

 

data clean.eHARS;

set clean.eHARS;

 

    dx_dt_day=input(hiv_aids_dx_dt,YYMMDD10.);

    format dx_dt_day YYMMDD10.;

 

dx_dt_mon=input(hiv_aids_dx_dt,YYMMDD10.);

    format dx_dt_day MONYY.;

 

dx_dt_year=input(hiv_aids_dx_dt,YYMMDD10.);

    format dx_dt_day YEARw.;

;

run;

 

Any help would be very much appreciated.

 

Thanks so much in advance!


Accepted Solutions
Solution
‎01-04-2017 06:19 PM
Respected Advisor
Posts: 3,900

Re: Working with incomplete dates

@Kels123

Will this code still work if there are periods in place of the missing components of the dates?

 

For the code I've already posted: No, as you haven't told us that your data looks this way.

 

One way to go: Just remove anything from the string that's not a digit.

 date_str=compress(date_str,,'kd');

 

data sample;
  input pattern :$8. date_str :$8.;
  date_str=compress(date_str,,'kd');
  format date_derived date9.;
  length pattern_used $8;
  select(lengthn(date_str));
    when(4) 
      do;
        pattern_used='YYYY';
        date_derived=mdy('07','01',date_str);
      end;
/*    when(6) */
/*      do;*/
/*        pattern_used='YYYYMM';*/
/*        date_derived=input(date_str,yymmn6.)+14;*/
/*      end;*/
    when(6) 
      do;
        pattern_used='YYYYMM';
        date_derived=intnx('month',input(date_str,yymmn6.),0,'m');
      end;
    when(8) 
      do;
        pattern_used='YYYYMMDD';
        date_derived=input(date_str,yymmdd8.);
      end;
    otherwise;
  end;
  datalines;
yyyy 2016.....
yyyymm 201612--
yyyymm 201602??
yyyymm 201502
yyyymmdd 20161205
n/a 20
;
run;

View solution in original post


All Replies
Super User
Posts: 17,907

Re: Working with incomplete dates

There isn't a set way to deal with this type of scenario. Your going to have to write your own rules. Look for papers on cleaning dates on lexjansen.com, there will be lots of examples. 

 

Your second date is 2008 and a day of 28, with no month? Or something else? 

 

Is your data originally in Excel, it may have messed up your data. 

 

Also, can you work off other dates in your dataset? Your likely going to do some manual data cleaning. 

Respected Advisor
Posts: 3,900

Re: Working with incomplete dates

[ Edited ]

SAS stores dates internally as a numeric value which is the count of days since 1/1/1960; so it's always a specific day.

 

For this reason conversion of an incomplete date string into a SAS date value must result in a specific calendar date and you must make a few decisions before you can code for such a conversion.

 

Questions

- year only: would you want to map this to a date beginning, middle or end of year?

- year and month: beginning, middle or end of month?

- year and day: which month would you chose?

 

- A date string like 200810 is ambigous and could be YearMonth or YearDay. How would you want to treat it?

- Is there any other information in your data which could help us to further identify the date (i.e. a column with a month name)?

 

Once you've answered above questions it's no more hard to write the code which does the conversion.

PROC Star
Posts: 1,099

Re: Working with incomplete dates

With the three examples shown, it would be pretty simple to technically deal with them, but as the other folks have stated, you also need to figure out what to replace the missing years, months, and days with.

 

Some other questions to avoid problems later on:

 

If it's a four digit number, is it always a year, or could it be a:
- month and day
- two digit year and month or day

 

Is a six digit number always a four digit year followed by a two digit month? What other possibilities exist?

 

Are year values always four digit, and months and days always two digits? If months and days are ever coded as one digit, it's going to be EXTREMELY challenging.

 

How many entries do you have? This will decide the balance between automated and manual resolution.

  Tom

Contributor
Posts: 48

Re: Working with incomplete dates

Hi Tom,

 

After some thought and data exploration, I have answers to the questions that you posited:

 

Luckily, the date strings are either year only, year and month only, or full date, and they are consistently in the form (YYYY...., YYYYMM.., or YYYYMMDD). For dates with missing DAYS, I would like to code these as the middle of the month (e.g. YYYYMM.. turns into YYYYMM15). For dates with missing DAYS and MONTHS, I would like to code these as the middle of the year (e.g. YYYY0702).

 

Out of thousands of dates, there are several hundred incomplete dates. I have manually gone through to assess the situation and it looks like only about 10 of the incomplete dates will significantly influence my research project outcome So I could technically manually recode these dates, but it would probably be better to go about this in a systematic way.

 

Here is an example of the code I used to format the dates:

 

data clean.eHERS;

set clean.eHERS;

new_vl_date1=input(vl_date1,yymmdd10.);

    format new_vl_date1 yymmdd10.;

run;

 

vl_date1 = the date in character form (with varying levels of completeness depending on the subject).

new_vl_date1 = the date formatted as a date (all incomplete character dates turn into missing values).

 

Would you be able to send me some example code for what to do with these dates prior to formatting so that they don't turn into missing values? Does this code also include a way to format the dates or create an indicator variable for which dates are "assumed" /approximate dates? Through all of my data manipulations, I want to make sure that I can keep track of which dates are full dates and which dates are "assumed" dates.

 

Thank you so very much!

 

Kelsey

Contributor
Posts: 48

Re: Working with incomplete dates

Hi Patrick,

 

After some thought and data exploration, I have answers to your questions. 

 

- year only: would you want to map this to a date beginning, middle or end of year? MIDDLE OF THE YEAR

- year and month: beginning, middle or end of month? MIDDLE OF THE MONTH (the 15th)

- year and day: which month would you chose? These situations do not exist (my previous example was misleading and has been corrected).

- A date string like 200810 is ambigous and could be YearMonth or YearDay. How would you want to treat it? Luckily, the date strings are either year only, year and month only, or full date.

- Is there any other information in your data which could help us to further identify the date (i.e. a column with a month name)? Unfortunately no, but luckily almost all cases of incomplete dates are years with months already specified. So by choosing to set the days for these incomplete dates as the 15th, worst case scenario most of the dates will be "off" by a maximum of 15 days.

 

Here is the code I used to format the dates:

 

 

data clean.eHERS;

set clean.eHERS;

new_vl_date1=input(vl_date1,yymmdd10.);

    format new_vl_date1 yymmdd10.;

run;

 

vl_date1 = the date in character form (with varying levels of completeness depending on the subject).

new_vl_date1 = the date formatted as a date (all incomplete character dates turn into missing values).

 

Would you be able to send me some example code for what to do with these dates prior to formatting so that they don't turn into missing values? Does this code also include a way to format the dates or create an indicator variable for which dates are "assumed" /approximate dates? Through all of my data manipulations, I want to make sure that I can keep track of which dates are full dates and which dates are "assumed" dates.

 

Thank you so very much!

 

 

 

 

Respected Advisor
Posts: 3,900

Re: Working with incomplete dates

[ Edited ]

@Kels123

From how you describe your data it seems your lucky and we can simply use the length of the source string to determine how to convert the string to a SAS date value.

 

I've posted in below code two versions for calculation of "middle of the month". The version in comment just sets the date always to the 15th, the other version uses the intnx() calendar function and sets the date to whatever the function returns as middle of the month.

data sample;
  input pattern :$8. date_str :$8.;
  format date_derived date9.;
  length pattern_used $8;
  select(lengthn(date_str));
    when(4) 
      do;
        pattern_used='YYYY';
        date_derived=mdy('07','01',date_str);
      end;
/*    when(6) */
/*      do;*/
/*        pattern_used='YYYYMM';*/
/*        date_derived=input(date_str,yymmn6.)+14;*/
/*      end;*/
    when(6) 
      do;
        pattern_used='YYYYMM';
        date_derived=intnx('month',input(date_str,yymmn6.),0,'m');
      end;
    when(8) 
      do;
        pattern_used='YYYYMMDD';
        date_derived=input(date_str,yymmdd8.);
      end;
    otherwise;
  end;
  datalines;
yyyy 2016
yyyymm 201612
yyyymm 201602
yyyymm 201502
yyyymmdd 20161205
n/a 20
;
run;

 

Contributor
Posts: 48

Re: Working with incomplete dates

Thank you!  Will this code still work if there are periods in place of the missing components of the dates? For example, 2008.... or 200710..  ? That seems to be the case for my data.

 

 

 

Super User
Posts: 17,907

Re: Working with incomplete dates

Are you reading this from a text file?
Contributor
Posts: 48

Re: Working with incomplete dates

No - the file was given to me as a SAS file. It seems that incomplete date variables are indicated using periods when the days/months are missing. Thanks.

Super User
Posts: 17,907

Re: Working with incomplete dates

I would suggest going back in your process, perhaps to whomever created the file. One possibility is someone concatented fields To create the data field. If this is the case, if you have the raw data it would make the process easier. 

Contributor
Posts: 48

Re: Working with incomplete dates

That may be very difficult since I received the data through a complicated, one-time, de-identified data request. If I am stuck with the data in its current form, is there any way to get around this problem?

Solution
‎01-04-2017 06:19 PM
Respected Advisor
Posts: 3,900

Re: Working with incomplete dates

@Kels123

Will this code still work if there are periods in place of the missing components of the dates?

 

For the code I've already posted: No, as you haven't told us that your data looks this way.

 

One way to go: Just remove anything from the string that's not a digit.

 date_str=compress(date_str,,'kd');

 

data sample;
  input pattern :$8. date_str :$8.;
  date_str=compress(date_str,,'kd');
  format date_derived date9.;
  length pattern_used $8;
  select(lengthn(date_str));
    when(4) 
      do;
        pattern_used='YYYY';
        date_derived=mdy('07','01',date_str);
      end;
/*    when(6) */
/*      do;*/
/*        pattern_used='YYYYMM';*/
/*        date_derived=input(date_str,yymmn6.)+14;*/
/*      end;*/
    when(6) 
      do;
        pattern_used='YYYYMM';
        date_derived=intnx('month',input(date_str,yymmn6.),0,'m');
      end;
    when(8) 
      do;
        pattern_used='YYYYMMDD';
        date_derived=input(date_str,yymmdd8.);
      end;
    otherwise;
  end;
  datalines;
yyyy 2016.....
yyyymm 201612--
yyyymm 201602??
yyyymm 201502
yyyymmdd 20161205
n/a 20
;
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 564 views
  • 1 like
  • 4 in conversation