09-11-2017 12:48 PM
This has been puzzling me for quite sometime now...I'm working with some very dirty data that came to me in the form of 41 Excel spreadsheets. One of the issues with this data is that some of the date values look like this:
That 9/26/201310/7/13 looked like this in the original data file:
So what I'm trying to figure out is - is there a way to remove everything in a variable value EXCEPT for the most recent date? That way all my single date values will remain the same, but all my multiple date values will only keep the most current and valid date.
Thanks so much for your time! I've tried searching for solutions to this but I think I'm just not using the right keywords.
09-11-2017 01:10 PM
@fwashburn please post the data as text, not an image. We'd have to type out your data to work with it, but it's easier if you do that instead of us
09-11-2017 01:19 PM
First, I'd do a countw with the slash as separator. If it's five, separate the first 4 characters (substr) from the third word, and the next part from position 5, so you can then convert the parts to dates (input to convert to numeric and mdy to build the dates).
09-11-2017 02:45 PM
Is there absolute consistency? You show two variants:
1. Three Five words separated with slashes. When this happens, the field is always m/d/y, with each word a varying number of digits. This one is easy.
2. Five words separated by slashes. In this case, the field appears to be m/d/x/d/y, with each word a varying number of digits. Words 1,2,4, and 5 are easy. Word 3 is easy, AS LONG AS THE YEAR IS ALWAYS FOUR DIGITS! You have cases in the other fields of a two-digit year. This will cause you huge problems.
All I can think of is this (pseudo code)
IF length of field 3 is less than 3, THEN *error* /* can't make a year and a month out of two or fewer digits. */
ELSE IF first two digits of field 3 are '20' /* has to be a four digit date, unless your years can include 2020 */
IF length of field 3 is 5
THEN take first four digits as year, fifth as month
ELSEIF length of field 3 is 6
THEN take first four digits as year, fifth and sixth as month
ELSE *error* /* can't make any sense out of a 3-4 or 7 or above character field starting with '20' */
ELSE DO /* first two digits not 20, must be a two-digit year */
IF length of field 3 is 3
THEN take first two digits as year, third as month
ELSEIF length of field 3 is 4
THEN take first two digits as year, third and fourth as month
ELSE *error* /* can't make any sense out of a 5 or above character field not starting with '20' */
But I'm highly doubtful. This looks like a huge unstructured data problem.
09-11-2017 05:05 PM
You are correct; this is a huge unstructured data problem. I've been wracking my brain trying to figure out if there's any code that would force the values into uniformity but thinking that it may not be possible. I'll try out your code and report back; thanks for your response!
09-11-2017 05:13 PM
Can you do a mass find and replace in the Excel fiel for all CTRL+ENTER values and change them to an asterisk or some other symbol that can be used to parse the data later on?
09-11-2017 02:48 PM
If the Excel file had a CTRL+ENTER in the file and you convert it to CSV and read it in, I believe the return is kept.
That would help parse the data and avoid other issues...possibly.
09-11-2017 05:07 PM
Unfortunately when I convert the file to a CSV format, it screws with a lot of the other data, so I haven't been able to do that successfully. Otherwise, that would be a great call!