Hello,
I'm trying to use perl-regular expressions to impute missing datetime variables to eventually put in a macro as the imputed variable (e.g adtmc) is quite common in datasets. My approach works to a degree but I was wondering is it possible to further expand it to highlight uncommon data problems. Also depending on data raw datetime variables can sometimes contain a space or a T in between date and time. My current code only accounts for spacing but how can I account for both but make sure the T is removed and replaced by a space? I've provided sample data, desired output and the code I'm using currently to apply
data have;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01T01:02
2020-01-01T01:
2020-01-01T01:89
2020-01-01T
2020-01-01' '/*Space indicates a possible datetime*/
2020-01-02
2020-01
2020
junk
;;;;
run;
data want;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01 01:02
2020-01-01 01:XX
2020-01-01 01:XX
2020-01-01 XX:XX
2020-01-01 XX:XX
2020-01-02
2020-01-XX
2020-XX-XX
junk
;;;;
run;
data want;
set have;
length adtmc $16;
array vals[3] $;
vals[1]='XXXX';
vals[2]='-XX';
vals[3]='-XX';
_rx = prxparse('/(\d{4})(-\d{2})?(-\d{2})?( \d{2}:\d{2})?/ios');
_rc = prxmatch(_rx,ecdtc); *this does the matching. Probably should check for value of _rc to make sure it matched before continuing.;
do _i = 1 to 4; *now iterate through the four capture buffers;
_rt = prxposn(_rx,_i,ecdtc);
if _i le 3 then vals[_i] = coalescec(_rt,vals[_i]);
else timepart = _rt; *we do the timepart outside the array since it needs to be catted with a space while the others do not, easier this way;
end;
adtmc = cats(of vals[*]); *cat them together now - if you do not capture the hyphen then use catx ('-',of vals[*]) instead;
if timepart ne ' ' then adtmc = catx(' ',adtmc,timepart); *and append the timepart after.;
run;
... View more