Hello,
I'm trying to use perl-regular expressions to impute missing datetime variables to eventually put in a macro as the imputed variable (e.g adtmc) is quite common in datasets. My approach works to a degree but I was wondering is it possible to further expand it to highlight uncommon data problems. Also depending on data raw datetime variables can sometimes contain a space or a T in between date and time. My current code only accounts for spacing but how can I account for both but make sure the T is removed and replaced by a space? I've provided sample data, desired output and the code I'm using currently to apply
data have;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01T01:02
2020-01-01T01:
2020-01-01T01:89
2020-01-01T
2020-01-01' '/*Space indicates a possible datetime*/
2020-01-02
2020-01
2020
junk
;;;;
run;
data want;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01 01:02
2020-01-01 01:XX
2020-01-01 01:XX
2020-01-01 XX:XX
2020-01-01 XX:XX
2020-01-02
2020-01-XX
2020-XX-XX
junk
;;;;
run;
data want;
set have;
length adtmc $16;
array vals[3] $;
vals[1]='XXXX';
vals[2]='-XX';
vals[3]='-XX';
_rx = prxparse('/(\d{4})(-\d{2})?(-\d{2})?( \d{2}:\d{2})?/ios');
_rc = prxmatch(_rx,ecdtc); *this does the matching. Probably should check for value of _rc to make sure it matched before continuing.;
do _i = 1 to 4; *now iterate through the four capture buffers;
_rt = prxposn(_rx,_i,ecdtc);
if _i le 3 then vals[_i] = coalescec(_rt,vals[_i]);
else timepart = _rt; *we do the timepart outside the array since it needs to be catted with a space while the others do not, easier this way;
end;
adtmc = cats(of vals[*]); *cat them together now - if you do not capture the hyphen then use catx ('-',of vals[*]) instead;
if timepart ne ' ' then adtmc = catx(' ',adtmc,timepart); *and append the timepart after.;
run;
It's always a tradeoff between validation strength and complexity, This should filter out many anomalies:
data want;
if not id then id + prxparse(
"/([12][90]\d\d-[01]\d-[0123]\d|[12][90]\d\d-[01]\d|[12][90]\d\d)[ T]([012]\d:[012345]\d|[012]\d:)?/io");
set have;
length adtmc $16;
if prxmatch(id, ecdtc) then do;
d = prxposn(id, 1, ecdtc);
t = prxposn(id, 2, ecdtc);
adtmc = catx(" ",
catx("-", scan(d,1), coalescec(scan(d,2), "XX"), coalescec(scan(d,3), "XX")),
catx(":", coalescec(scan(t,1,":"), "XX"), coalescec(scan(t,2,":"), "XX")) );
end;
else adtmc = ecdtc;
drop id d t;
run;
It's always a tradeoff between validation strength and complexity, This should filter out many anomalies:
data want;
if not id then id + prxparse(
"/([12][90]\d\d-[01]\d-[0123]\d|[12][90]\d\d-[01]\d|[12][90]\d\d)[ T]([012]\d:[012345]\d|[012]\d:)?/io");
set have;
length adtmc $16;
if prxmatch(id, ecdtc) then do;
d = prxposn(id, 1, ecdtc);
t = prxposn(id, 2, ecdtc);
adtmc = catx(" ",
catx("-", scan(d,1), coalescec(scan(d,2), "XX"), coalescec(scan(d,3), "XX")),
catx(":", coalescec(scan(t,1,":"), "XX"), coalescec(scan(t,2,":"), "XX")) );
end;
else adtmc = ecdtc;
drop id d t;
run;
Very nice solution thanks!. Quick question does the id+prxparse have to be before the set statement?. Also does this pick up when no time is present and so will only impute the date and not the full datetime imputation if no timepart exists?
No it doesn't. As long as it runs before PRXMATCH. In any case, you should make sure that the datastep variable names are not the same as the input dataset variable names.
Is there a way to automate to check the length of the variable(in this case ecdtc) to see what imputation is required?. Basically I'm trying to put this into a macro and some date variables will have different imputation requirements based on if it is a date or datetime
Apologies I've updated my have/want datasets to illustrate my approach better
you could use the length function on ecdtc and code the logic accordingly.
data have;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01T01:02
2020-01-01T01:
2020-01-01T01:89
2020-01-01T
2020-01-01
2020-01-02
2020-01
2020
junk
;;;;
run;
data want;
set have;
want='XXXX-XX-XX XX:XX';
if anydigit(ecdtc)=1 then do;
do i=1 to length(ecdtc);
substr(want,i,1)=substr(translate(ecdtc,' ','T'),i,1);
end;
end;
else want=ecdtc;
drop i;
run;
2020-01-01' '/*Space indicates a possible datetime*/
Remember that SAS stores character strings as fixed length. There is no way to tell the difference between '2020-01-01' and '2020-01-01 ' once you have the value in a variable.
Thanks everyone. Some really useful approaches here!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.