BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
smackerz1988
Pyrite | Level 9

Hello,

 

I'm trying to use perl-regular expressions to impute missing datetime variables to eventually put in a macro as the imputed variable (e.g adtmc) is quite common in datasets. My approach works to a degree but I was wondering is it possible to further expand it to highlight uncommon data problems. Also depending on data raw datetime variables can sometimes contain a space or a T in between date and time. My current code only accounts for spacing but how can I account for both but make sure the T is removed and replaced by a space? I've provided sample data, desired output and the code I'm using currently to apply 

data have;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01T01:02
2020-01-01T01:
2020-01-01T01:89
2020-01-01T
2020-01-01' '/*Space indicates a possible datetime*/
2020-01-02
2020-01
2020
junk
;;;;
run;


data want;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01 01:02
2020-01-01 01:XX
2020-01-01 01:XX
2020-01-01 XX:XX
2020-01-01 XX:XX 
2020-01-02 
2020-01-XX
2020-XX-XX
junk
;;;;
run;


data want;
  set have;
  length adtmc $16;
  array vals[3] $;
  vals[1]='XXXX';
  vals[2]='-XX';
  vals[3]='-XX';
  _rx = prxparse('/(\d{4})(-\d{2})?(-\d{2})?( \d{2}:\d{2})?/ios');
  _rc = prxmatch(_rx,ecdtc);   *this does the matching.  Probably should check for value of _rc to make sure it matched before continuing.;
  do _i = 1 to 4;   *now iterate through the four capture buffers;
    _rt = prxposn(_rx,_i,ecdtc);
    if _i le 3 then vals[_i] = coalescec(_rt,vals[_i]);
    else timepart = _rt;  *we do the timepart outside the array since it needs to be catted with a space while the others do not, easier this way;
  end;
  
  adtmc = cats(of vals[*]);  *cat them together now - if you do not capture the hyphen then use catx ('-',of vals[*]) instead;
  if timepart ne ' ' then adtmc = catx(' ',adtmc,timepart);  *and append the timepart after.;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

It's always a tradeoff between validation strength and complexity, This should filter out many anomalies:

 

data want;
if not id then id + prxparse(
    "/([12][90]\d\d-[01]\d-[0123]\d|[12][90]\d\d-[01]\d|[12][90]\d\d)[ T]([012]\d:[012345]\d|[012]\d:)?/io");
set have;
length adtmc $16;
if prxmatch(id, ecdtc) then do;
    d = prxposn(id, 1, ecdtc);
    t = prxposn(id, 2, ecdtc);
    adtmc = catx(" ", 
        catx("-", scan(d,1), coalescec(scan(d,2), "XX"), coalescec(scan(d,3), "XX")),
        catx(":", coalescec(scan(t,1,":"), "XX"), coalescec(scan(t,2,":"), "XX")) );
    end;
else adtmc = ecdtc;
drop id d t;
run;

PGStats_0-1667076982354.png

 

PG

View solution in original post

9 REPLIES 9
PGStats
Opal | Level 21

It's always a tradeoff between validation strength and complexity, This should filter out many anomalies:

 

data want;
if not id then id + prxparse(
    "/([12][90]\d\d-[01]\d-[0123]\d|[12][90]\d\d-[01]\d|[12][90]\d\d)[ T]([012]\d:[012345]\d|[012]\d:)?/io");
set have;
length adtmc $16;
if prxmatch(id, ecdtc) then do;
    d = prxposn(id, 1, ecdtc);
    t = prxposn(id, 2, ecdtc);
    adtmc = catx(" ", 
        catx("-", scan(d,1), coalescec(scan(d,2), "XX"), coalescec(scan(d,3), "XX")),
        catx(":", coalescec(scan(t,1,":"), "XX"), coalescec(scan(t,2,":"), "XX")) );
    end;
else adtmc = ecdtc;
drop id d t;
run;

PGStats_0-1667076982354.png

 

PG
smackerz1988
Pyrite | Level 9

Very nice solution thanks!. Quick question does the id+prxparse have to be before the set statement?. Also does this pick up when no time is present and so will only impute the date and not the full datetime imputation if no timepart exists?

PGStats
Opal | Level 21

No it doesn't. As long as it runs before PRXMATCH. In any case, you should make sure that the datastep variable names are not the same as the input dataset variable names.

PG
smackerz1988
Pyrite | Level 9

Is there a way to automate to check the length of the variable(in this case ecdtc) to see what imputation is required?. Basically I'm trying to put this into a macro and some date variables will have different imputation requirements based on if it is a date or datetime

smackerz1988
Pyrite | Level 9

Apologies I've updated my have/want datasets to illustrate my approach better

tarheel13
Rhodochrosite | Level 12

you could use the length function on ecdtc and code the logic accordingly. 

Ksharp
Super User
data have;
length ecdtc $16;
infile datalines truncover;
input @1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-01T01:02
2020-01-01T01:
2020-01-01T01:89
2020-01-01T
2020-01-01
2020-01-02
2020-01
2020
junk
;;;;
run;

data want;
 set have;
 want='XXXX-XX-XX XX:XX';
 if anydigit(ecdtc)=1 then do;
   do i=1 to length(ecdtc);
     substr(want,i,1)=substr(translate(ecdtc,' ','T'),i,1);
   end;
 end;
 else want=ecdtc;
 drop i;
run;
Tom
Super User Tom
Super User

2020-01-01' '/*Space indicates a possible datetime*/

 

Remember that SAS stores character strings as fixed length.  There is no way to tell the difference between '2020-01-01' and '2020-01-01     ' once you have the value in a variable.

smackerz1988
Pyrite | Level 9

Thanks everyone. Some really useful approaches here!

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1944 views
  • 5 likes
  • 5 in conversation