Solved: Perl-regular expressions in SAS to add imputations to date and highlig...

smackerz1988 · Posted 10-29-2022 09:39 AM

Hello,

I'm trying to use perl-regular expressions to impute missing datetime variables to eventually put in a macro as the imputed variable (e.g adtmc) is quite common in datasets. My approach works to a degree but I was wondering is it possible to further expand it to highlight uncommon data problems. Also depending on data raw datetime variables can sometimes contain a space or a T in between date and time. My current code only accounts for spacing but how can I account for both but make sure the T is removed and replaced by a space? I've provided sample data, desired output and the code I'm using currently to apply

data have;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01T01:02
2020-01-01T01:
2020-01-01T01:89
2020-01-01T
2020-01-01' '/*Space indicates a possible datetime*/
2020-01-02
2020-01
2020
junk
;;;;
run;


data want;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01 01:02
2020-01-01 01:XX
2020-01-01 01:XX
2020-01-01 XX:XX
2020-01-01 XX:XX 
2020-01-02 
2020-01-XX
2020-XX-XX
junk
;;;;
run;


data want;
  set have;
  length adtmc $16;
  array vals[3] $;
  vals[1]='XXXX';
  vals[2]='-XX';
  vals[3]='-XX';
  _rx = prxparse('/(\d{4})(-\d{2})?(-\d{2})?( \d{2}:\d{2})?/ios');
  _rc = prxmatch(_rx,ecdtc);   *this does the matching.  Probably should check for value of _rc to make sure it matched before continuing.;
  do _i = 1 to 4;   *now iterate through the four capture buffers;
    _rt = prxposn(_rx,_i,ecdtc);
    if _i le 3 then vals[_i] = coalescec(_rt,vals[_i]);
    else timepart = _rt;  *we do the timepart outside the array since it needs to be catted with a space while the others do not, easier this way;
  end;
  
  adtmc = cats(of vals[*]);  *cat them together now - if you do not capture the hyphen then use catx ('-',of vals[*]) instead;
  if timepart ne ' ' then adtmc = catx(' ',adtmc,timepart);  *and append the timepart after.;
run;

PGStats · Posted 10-29-2022 04:58 PM

It's always a tradeoff between validation strength and complexity, This should filter out many anomalies:

data want;
if not id then id + prxparse(
    "/([12][90]\d\d-[01]\d-[0123]\d|[12][90]\d\d-[01]\d|[12][90]\d\d)[ T]([012]\d:[012345]\d|[012]\d:)?/io");
set have;
length adtmc $16;
if prxmatch(id, ecdtc) then do;
    d = prxposn(id, 1, ecdtc);
    t = prxposn(id, 2, ecdtc);
    adtmc = catx(" ", 
        catx("-", scan(d,1), coalescec(scan(d,2), "XX"), coalescec(scan(d,3), "XX")),
        catx(":", coalescec(scan(t,1,":"), "XX"), coalescec(scan(t,2,":"), "XX")) );
    end;
else adtmc = ecdtc;
drop id d t;
run;

PG

View solution in original post

PGStats · Posted 10-29-2022 04:58 PM

It's always a tradeoff between validation strength and complexity, This should filter out many anomalies:

data want;
if not id then id + prxparse(
    "/([12][90]\d\d-[01]\d-[0123]\d|[12][90]\d\d-[01]\d|[12][90]\d\d)[ T]([012]\d:[012345]\d|[012]\d:)?/io");
set have;
length adtmc $16;
if prxmatch(id, ecdtc) then do;
    d = prxposn(id, 1, ecdtc);
    t = prxposn(id, 2, ecdtc);
    adtmc = catx(" ", 
        catx("-", scan(d,1), coalescec(scan(d,2), "XX"), coalescec(scan(d,3), "XX")),
        catx(":", coalescec(scan(t,1,":"), "XX"), coalescec(scan(t,2,":"), "XX")) );
    end;
else adtmc = ecdtc;
drop id d t;
run;

PG

smackerz1988 · Posted 10-29-2022 05:22 PM

Very nice solution thanks!. Quick question does the id+prxparse have to be before the set statement?. Also does this pick up when no time is present and so will only impute the date and not the full datetime imputation if no timepart exists?

PGStats · Posted 10-29-2022 05:34 PM

No it doesn't. As long as it runs before PRXMATCH. In any case, you should make sure that the datastep variable names are not the same as the input dataset variable names.

PG

smackerz1988 · Posted 10-29-2022 05:37 PM

Is there a way to automate to check the length of the variable(in this case ecdtc) to see what imputation is required?. Basically I'm trying to put this into a macro and some date variables will have different imputation requirements based on if it is a date or datetime

smackerz1988 · Posted 10-29-2022 05:46 PM

Apologies I've updated my have/want datasets to illustrate my approach better

tarheel13 · Posted 10-29-2022 08:27 PM

you could use the length function on ecdtc and code the logic accordingly.

Ksharp · Posted 10-30-2022 05:57 AM

data have;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01T01:02
2020-01-01T01:
2020-01-01T01:89
2020-01-01T
2020-01-01
2020-01-02
2020-01
2020
junk
;;;;
run;

data want;
 set have;
 want='XXXX-XX-XX XX:XX';
 if anydigit(ecdtc)=1 then do;
   do i=1 to length(ecdtc);
     substr(want,i,1)=substr(translate(ecdtc,' ','T'),i,1);
   end;
 end;
 else want=ecdtc;
 drop i;
run;

Tom · Posted 10-30-2022 09:47 AM

2020-01-01' '/*Space indicates a possible datetime*/

Remember that SAS stores character strings as fixed length. There is no way to tell the difference between '2020-01-01' and '2020-01-01 ' once you have the value in a variable.

smackerz1988 · Posted 10-30-2022 10:36 AM

Thanks everyone. Some really useful approaches here!

Perl-regular expressions in SAS to add imputations to date and highlight data errors

Re: Perl-regular expressions in SAS to add imputations to date and highlight data errors

Re: Perl-regular expressions in SAS to add imputations to date and highlight data errors

Re: Perl-regular expressions in SAS to add imputations to date and highlight data errors

Re: Perl-regular expressions in SAS to add imputations to date and highlight data errors

Re: Perl-regular expressions in SAS to add imputations to date and highlight data errors

Re: Perl-regular expressions in SAS to add imputations to date and highlight data errors

Re: Perl-regular expressions in SAS to add imputations to date and highlight data errors

Re: Perl-regular expressions in SAS to add imputations to date and highlight data errors

Re: Perl-regular expressions in SAS to add imputations to date and highlight data errors

Re: Perl-regular expressions in SAS to add imputations to date and highlight data errors

SAS Innovate 2025: Call for Content

Classroom Training Available!