SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Impose additional constraints on PRXMATCH user-defined character class?

Accepted Solution Solved
Reply
Contributor SM1
Contributor
Posts: 51
Accepted Solution

Impose additional constraints on PRXMATCH user-defined character class?

[ Edited ]

Hello,

 

I'm developing syntax to QA a date/time string variable in an 11-character format.The time should be written according to a 24-hour clock (or as military time).

 

If written correctly, the value for 7 June 2017 at 10:30 am would be: 071030JUN17

The value for 7 June 2017 at 10:30 pm would be: 072230JUN17

 

I am not disaggregating the variable into separate time and date variables from the start for reasons that would involve a long explanation. But that option was considered and rejected.

 

Using the date/time value of 7 June 2017 at 10:30 am, here are examples of the most common errors:

 

1. time omitted = 07AUG17

2. leading zero of day omitted = 71030JUN17

3. leading zero of day and time both omitted = 7JUN17

4. June spelled out entirely = 071030JUNE17

5. year written as 4-digits = 071030JUN2017

6. day and time written in reverse order = 103007JUN17

 

and

 

7. 0 (zero) instead of O (letter O) in month: 0710300CT17 - only relevant for dates in October

 

I am trying to use PRXMATCH syntax to differentiate between values that match the correct value pattern and those that do not. Each observation has a unique ReportID. The date/time incident variable (IncDateTime) has no spaces between any of the (as many as) 11 characters/digits.

 

I ran variations on the following basic syntax:

 

data prxmatch_incident;

set incident (keep = ReportID IncDateTime);

position = prxmatch("/\JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC\d\d/",IncDateTime);

run;

 

This syntax comes close to doing what is needed. Unfortunately, the syntax recognizes the following values similarly and determines position = 3 for each value:

 

07JUN17

07JUN2017

07JUNE17

 

So I need to further limit the paramters of the pattern so that JUN is not the same as JUNE and 17 is not the same as 2017. If I can confidently distinguish between the values by pattern error, I can write syntax to further clean the data automatically.

 

Other "position = prxmatch" statements that were tried, but did not work include:

 

 

position_full = prxmatch("/\d\d\d\d\d\d\JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC\d\d/",inc);
position_date = prxmatch("/\d\d\[ADFJMNOS]\[ACEOPU]\[BCGLNPRTVY]\d\d/",inc);
position_month2 = prxmatch("/\[ADFJMNOS-ACEOPU-BCGLNPRTVY]\d\d/",inc);

 

Any suggestions to further refine the PRXMATCH statement will be most appreciated!

 

Thank you!


Accepted Solutions
Solution
‎08-22-2017 06:23 PM
Esteemed Advisor
Posts: 5,482

Re: Impose additional constraints on PRXMATCH user-defined character class?

You can push the validation quite far:

 

data test;
  input have_str :$20.;
  pos = prxmatch('/\b(0[1-9]|[12]\d|3[01])([01]\d|2[0-3])([0-5]\d)(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)(\d\d)\b/o',have_str);
datalines;
071030JUN17
072230JUN17
07AUG17
71030JUN17
7JUN17
071030JUNE17
071030JUN2017
103007JUN17
070020JUN17
002007JUN17
;

proc print noobs; run;
PG

View solution in original post


All Replies
PROC Star
Posts: 1,288

Re: Impose additional constraints on PRXMATCH user-defined character class?

Oh, I LOVE these problems!

 

If it were me, and if volume weren't a problem, I'd tackle it in stages.

 

Step 1: does my variable meet the format 999999AAA99. If it doesn't, reject it.

Step 2: okay, the broad format is reasonable. Now start looking at the pieces. Do characters 3 through 6 make sense as an hour and a minute? (3-4 between "00" and "24", 5-6 between "00" and "60"). If not, reject it.

Step 3: Are characters 7 through 9 one of your date values? Do characters 10 and 11 fall within your basket of acceptable year values? Are characters 1 to 2 between 01 and 31?

And finally, step 4: do the month and day together constitue a valid date?

 

For doing this kind of thing, the SAS functions are your friend. One of the secret sauce components of SAS is the incredibly rich set of functions that you can use.

 

Tom

Respected Advisor
Posts: 4,687

Re: Impose additional constraints on PRXMATCH user-defined character class?

[ Edited ]

@SM1

Something like below could do.

data test;
  infile datalines truncover;
  input have_str $20.;
  test= prxmatch('/^[0123]\d[012]\d[012345]\d(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\d\d$/',strip(have_str));
datalines;
071030JUN17
072230JUN17
07AUG17
71030JUN17
7JUN17
071030JUNE17
071030JUN2017
103007JUN17
run;
Contributor SM1
Contributor
Posts: 51

Re: Impose additional constraints on PRXMATCH user-defined character class?

@Patrick

 

Your solution also worked - thank you!

Solution
‎08-22-2017 06:23 PM
Esteemed Advisor
Posts: 5,482

Re: Impose additional constraints on PRXMATCH user-defined character class?

You can push the validation quite far:

 

data test;
  input have_str :$20.;
  pos = prxmatch('/\b(0[1-9]|[12]\d|3[01])([01]\d|2[0-3])([0-5]\d)(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)(\d\d)\b/o',have_str);
datalines;
071030JUN17
072230JUN17
07AUG17
71030JUN17
7JUN17
071030JUNE17
071030JUN2017
103007JUN17
070020JUN17
002007JUN17
;

proc print noobs; run;
PG
Contributor SM1
Contributor
Posts: 51

Re: Impose additional constraints on PRXMATCH user-defined character class?

@PG Stats

 

Your solution worked to clearly identify those variables that met the pattern of ######CCC## or not.

 

I tweaked my syntax with the tail end of your suggested syntax as follows:

 

data test;
  input have_str :$20.;
  pos = prxmatch('/(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)(\d\d)\b/o',have_str);
datalines;071030JUN17
072230JUN17
07AUG17
71030JUN17
7JUN17
071030JUNE17
071030JUN2017
103007JUN17
070020JUN17
002007JUN17;

 I'm using the new variable to act as a flag for the different types of errors (since POS = 0, 2, 3, 5, and 7) in the data.

 

Thanks for your help!

Super User
Posts: 13,338

Re: Impose additional constraints on PRXMATCH user-defined character class?

Sounds to me that either a data entry program is poorly written that allows garbage for entry ( this data has multiple fields day, month year and time but apparentlyis entered as one) or that some user training involved lashes with wet spaghetti noodles is in the offing.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 377 views
  • 2 likes
  • 5 in conversation