BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
smackerz1988
Quartz | Level 8

Hello,

 

I'm trying to use perl-regular expressions to impute missing datetime variables to eventually put in a macro as the imputed variable (e.g adtmc) is quite common in datasets. My approach works to a degree but I was wondering is it possible to further expand it to highlight uncommon data problems. Also depending on data raw datetime variables can sometimes contain a space or a T in between date and time. My current code only accounts for spacing but how can I account for both but make sure the T is removed and replaced by a space? I've provided sample data, desired output and the code I'm using currently to apply 

data have;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01T01:02
2020-01-01T01:
2020-01-01T01:89
2020-01-01T
2020-01-01' '/*Space indicates a possible datetime*/
2020-01-02
2020-01
2020
junk
;;;;
run;


data want;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01 01:02
2020-01-01 01:XX
2020-01-01 01:XX
2020-01-01 XX:XX
2020-01-01 XX:XX 
2020-01-02 
2020-01-XX
2020-XX-XX
junk
;;;;
run;


data want;
  set have;
  length adtmc $16;
  array vals[3] $;
  vals[1]='XXXX';
  vals[2]='-XX';
  vals[3]='-XX';
  _rx = prxparse('/(\d{4})(-\d{2})?(-\d{2})?( \d{2}:\d{2})?/ios');
  _rc = prxmatch(_rx,ecdtc);   *this does the matching.  Probably should check for value of _rc to make sure it matched before continuing.;
  do _i = 1 to 4;   *now iterate through the four capture buffers;
    _rt = prxposn(_rx,_i,ecdtc);
    if _i le 3 then vals[_i] = coalescec(_rt,vals[_i]);
    else timepart = _rt;  *we do the timepart outside the array since it needs to be catted with a space while the others do not, easier this way;
  end;
  
  adtmc = cats(of vals[*]);  *cat them together now - if you do not capture the hyphen then use catx ('-',of vals[*]) instead;
  if timepart ne ' ' then adtmc = catx(' ',adtmc,timepart);  *and append the timepart after.;
run;
0 Likes
1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Tourmaline | Level 20

It's always a tradeoff between validation strength and complexity, This should filter out many anomalies:

 

data want;
if not id then id + prxparse(
    "/([12][90]\d\d-[01]\d-[0123]\d|[12][90]\d\d-[01]\d|[12][90]\d\d)[ T]([012]\d:[012345]\d|[012]\d:)?/io");
set have;
length adtmc $16;
if prxmatch(id, ecdtc) then do;
    d = prxposn(id, 1, ecdtc);
    t = prxposn(id, 2, ecdtc);
    adtmc = catx(" ", 
        catx("-", scan(d,1), coalescec(scan(d,2), "XX"), coalescec(scan(d,3), "XX")),
        catx(":", coalescec(scan(t,1,":"), "XX"), coalescec(scan(t,2,":"), "XX")) );
    end;
else adtmc = ecdtc;
drop id d t;
run;

PGStats_0-1667076982354.png

 

PG

View solution in original post

9 REPLIES 9
PGStats
Tourmaline | Level 20

It's always a tradeoff between validation strength and complexity, This should filter out many anomalies:

 

data want;
if not id then id + prxparse(
    "/([12][90]\d\d-[01]\d-[0123]\d|[12][90]\d\d-[01]\d|[12][90]\d\d)[ T]([012]\d:[012345]\d|[012]\d:)?/io");
set have;
length adtmc $16;
if prxmatch(id, ecdtc) then do;
    d = prxposn(id, 1, ecdtc);
    t = prxposn(id, 2, ecdtc);
    adtmc = catx(" ", 
        catx("-", scan(d,1), coalescec(scan(d,2), "XX"), coalescec(scan(d,3), "XX")),
        catx(":", coalescec(scan(t,1,":"), "XX"), coalescec(scan(t,2,":"), "XX")) );
    end;
else adtmc = ecdtc;
drop id d t;
run;

PGStats_0-1667076982354.png

 

PG
smackerz1988
Quartz | Level 8

Very nice solution thanks!. Quick question does the id+prxparse have to be before the set statement?. Also does this pick up when no time is present and so will only impute the date and not the full datetime imputation if no timepart exists?

0 Likes
PGStats
Tourmaline | Level 20

No it doesn't. As long as it runs before PRXMATCH. In any case, you should make sure that the datastep variable names are not the same as the input dataset variable names.

PG
0 Likes
smackerz1988
Quartz | Level 8

Is there a way to automate to check the length of the variable(in this case ecdtc) to see what imputation is required?. Basically I'm trying to put this into a macro and some date variables will have different imputation requirements based on if it is a date or datetime

0 Likes
smackerz1988
Quartz | Level 8

Apologies I've updated my have/want datasets to illustrate my approach better

0 Likes
tarheel13
Barite | Level 11

you could use the length function on ecdtc and code the logic accordingly. 

0 Likes
Ksharp
Diamond | Level 26
data have;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01T01:02
2020-01-01T01:
2020-01-01T01:89
2020-01-01T
2020-01-01
2020-01-02
2020-01
2020
junk
;;;;
run;

data want;
 set have;
 want='XXXX-XX-XX XX:XX';
 if anydigit(ecdtc)=1 then do;
   do i=1 to length(ecdtc);
     substr(want,i,1)=substr(translate(ecdtc,' ','T'),i,1);
   end;
 end;
 else want=ecdtc;
 drop i;
run;
Tom
Super User Tom
Super User

2020-01-01' '/*Space indicates a possible datetime*/

 

Remember that SAS stores character strings as fixed length.  There is no way to tell the difference between '2020-01-01' and '2020-01-01     ' once you have the value in a variable.

smackerz1988
Quartz | Level 8

Thanks everyone. Some really useful approaches here!

0 Likes
JUST RELEASED

SAS Viya with pay-as-you-go pricing.

Deploy software automatically at the click of a button on the Microsoft Azure Marketplace.

Learn more

55 percent discount.jpg

Recommended by SAS
These recommendations are generated using AI from SAS. For personalized recommendations, sign in with your SAS profile.
Discussion stats
  • 9 replies
  • 1008 views
  • 5 likes
  • 5 in conversation