I have a data sources that contains a column (string) that is not consistent with respect to how it stores dates. Some are in the form mm/dd/yyyy, some are mm/yyyy and some are yyyy.
The anydtdte21. informat performs admirably, with the excpetion of just a year given. In that case, it returns a missing value.
How do I create an informat that tries anydtdte21. first, and if it fails assumes the input is in the format YYYY?
This code is used as an example
data _a;
length dtTXT $24.;
dtTxt = "03/01/2015"; output;
dtTxt = "02/2015"; output;
dtTXT = "2015"; output;
run;
data _b;
set _a;
format _dt _dt2 date.;
*** I would like to replace this code ***;
_dt = input(dtTxt, anydtdte21.);
if _dt = . then _dt = input("01/01/"||dtTxt, anydtdte21.);
**** REPLACE anydtdte21. here with a custome informat to achieve the same results as the above code ****;
_dt2 = input(dtTxt, anydtdte21.);
run;
proc print data=_b; run;
I would like to create an informat that correctly translates all three obs.
The B8601DA4. format will read four digit years as the first day of the year.
proc format;
invalue messyDate
"/^\d{1,2}\D\d{1,2}\D\d{4}/" (regexp) = [mmddyy10.]
"/^\d{4}/" (regexp) = [B8601DA.]
other = [anydtdte32.]
;
run;
You aren't going to find an informat that will handle all three. I've been somewhat curious about how SAS goes about implementing the Anydtdte and related informats as the tools we have for building informats with Proc Format don't seem possible to build them.
If you insist on a single line of code you could create a function that would contain your existing code to return a date value but I doubt the added complexity helps much.
If you only have 3 variations you may as well read it in as a character, check the length and use the appropriate format for the length.
if length(char_date) = 4 then date=mdy(1, 1, input(char_date, 4.));
else date=input(char_date, anydtdte.);
I really wish that my/our colleagues would stop saying that something can't be done. That has taken me away from my real work for more hours than I'd care to count.
Regardless, here is your custom informat:
data _a; length dtTXT $24.; dtTxt = "03/01/2015"; output; dtTxt = "02/2015"; output; dtTXT = "2015"; output; run; data infmt; retain fmtname "yearonly" type "I" ; do start=1900 to 2100; label = mdy(1,1,start); output ; end ; run;
proc format cntlin = infmt ; run ;
proc format; invalue dte 1900-3000=yearonly. other=anydtdte21.; run; data _b; set _a; format _dt _dt2 date.; *** I would like to replace this code ***; _dt = input(dtTxt, anydtdte21.); if _dt = . then _dt = input("01/01/"||dtTxt, anydtdte21.); **** REPLACE anydtdte21. here with a custom informat to achieve the same results as the above code ****; _dt2 = input(dtTxt, dte.); run;
Art, CEO, AnalystFinder.com
Actually, the anydtdte format does not perform "admirably", it results in wrong values for your dates:
data test;
input datestring :$21.;
date = input(datestring,anydtdte21.);
format date date9.;
cards;
03/01/2015
12/12/2017
12/10/2017
10/12/2017
10/13/2017
12/2017
2017
;
run;
proc print data=test noobs;
run;
Result:
datestring date 03/01/2015 03JAN2015 12/12/2017 12DEC2017 12/10/2017 12OCT2017 10/12/2017 10DEC2017 10/13/2017 13OCT2017 12/2017 01DEC2017 2017 .
As you can see, it only uses MMDDYY when the second digits exceed 12, otherwise you get DDMMYY.
I would never use that format because of such shortcomings. Instead expand shorter strings to your likings with clearly defined default values, and then use one matching format that only accepts one defined sequence. If you allow the computer to do your thinking for you, you'll end up with grief. Computers are dumb.
idea: Using regex in proc format and a function created with proc fcmp. The function is necessary because in a format defnition only functions with zero or one argument can be called and we need mdy().
EDIT: Fixed the format definition, still uses anydtdte - could be replaced by another function.
EDIT2: Fixed bad format of code.
data work.have;
length dtTXT $24.;
dtTxt = "03/01/2015"; output;
dtTxt = "02/2015"; output;
dtTXT = "2015"; output;
run;
proc fcmp outlib=work.funcs.dates;
function createDate(YearStr $);
return (mdy(1, 1, left(YearStr)));
endsub;
run;
options cmplib= work.funcs;
proc format;
invalue GetDate
"/^\d\d\/\d\d\/\d{4}/" (regexp) = [mmddyy10.]
"/^\d\d\/\d{4}/" (regexp) = [anydtdte.]
"/^\d{4}/" (regexp) = [createDate()]
other = 0
;
run;
data work.want;
set have;
format d Date9.;
put dtTxt=;
d = input(dtTxt, GetDate.);
run;
Thanks all for the feedback. I am trying to avoid using a user defned function, and want to accomplish this with informats.
I've combined several of the suggestions and have come up with this:
(Thanks @Kurt_Bremser for pointing out anydtdte's, ummm, "feature").
data infmt; retain fmtname "yyyy" type "I" ; do start=0001 to 9999; label = mdy(1,1,start); output ; end ; run; proc format cntlin = infmt ; run ; proc format; invalue messyDate "/^\d{1,2}\D\d{1,2}\D\d{4}/" (regexp) = [mmddyy10.] "/^\d{4}/" (regexp) = [yyyy.] other = [anydtdte32.] ; run; data test; infile datalines truncover dlm="|"; input datestring :$64.; format datestring $64.; formatedDate = input(datestring, messyDate.); format formatedDate date11.; cards; 12/12/2017 12/10/2017 10/12/2017 10/13/2017 03/01/2015 03-01-2015 03.01.2015
03 01 2015 3/1/2015 03/2015 3/2015 2017 1March2015 01March2015 1MAR2015 Mar 01 2015 MAR 01, 2015 March 1, 2015 March 1 2015 Mar 2015 March 2015 March, 2015 ; run; proc print data=test; run;
As you can see, I was a little coy regarding how messy the data are. This seems to work with the exception of the last two rows, i.e. when the full month is spelled out but no day is given.
The B8601DA4. format will read four digit years as the first day of the year.
proc format;
invalue messyDate
"/^\d{1,2}\D\d{1,2}\D\d{4}/" (regexp) = [mmddyy10.]
"/^\d{4}/" (regexp) = [B8601DA.]
other = [anydtdte32.]
;
run;
I Did not know you could regex your way through an informat. I was looking to solve a similar problem, but this is just what I need. Some of my SAS knowledge needs updating. I appreciated this example. I would have eventually found it in the docs, but I found it here!
Updating the informat slightly to include mmddyy regex without the slashes
proc format; invalue messyDate "Unknown" = . "/^\d{1,2}\D\d{1,2}\D\d{4}/" (regexp) = [mmddyy10.] "/^\d{1,2}\d{1,2}\d{4}/" (regexp) = [mmddyy10.] "/^\d{4}/" (regexp) = [B8601DA4.] other = [anydtdte32.] ; run;
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.