- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Howdy folks!
This has been puzzling me for quite sometime now...I'm working with some very dirty data that came to me in the form of 41 Excel spreadsheets. One of the issues with this data is that some of the date values look like this:
That 9/26/201310/7/13 looked like this in the original data file:
So what I'm trying to figure out is - is there a way to remove everything in a variable value EXCEPT for the most recent date? That way all my single date values will remain the same, but all my multiple date values will only keep the most current and valid date.
Thanks so much for your time! I've tried searching for solutions to this but I think I'm just not using the right keywords.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Can you please type the values and not paste coz I am lazy to type in my SAS editor to write the program
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@fwashburn please post the data as text, not an image. We'd have to type out your data to work with it, but it's easier if you do that instead of us 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
First, I'd do a countw with the slash as separator. If it's five, separate the first 4 characters (substr) from the third word, and the next part from position 5, so you can then convert the parts to dates (input to convert to numeric and mdy to build the dates).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Is there absolute consistency? You show two variants:
1. Three Five words separated with slashes. When this happens, the field is always m/d/y, with each word a varying number of digits. This one is easy.
2. Five words separated by slashes. In this case, the field appears to be m/d/x/d/y, with each word a varying number of digits. Words 1,2,4, and 5 are easy. Word 3 is easy, AS LONG AS THE YEAR IS ALWAYS FOUR DIGITS! You have cases in the other fields of a two-digit year. This will cause you huge problems.
All I can think of is this (pseudo code)
IF length of field 3 is less than 3, THEN *error* /* can't make a year and a month out of two or fewer digits. */
ELSE IF first two digits of field 3 are '20' /* has to be a four digit date, unless your years can include 2020 */
THEN DO
IF length of field 3 is 5
THEN take first four digits as year, fifth as month
ELSEIF length of field 3 is 6
THEN take first four digits as year, fifth and sixth as month
ELSE *error* /* can't make any sense out of a 3-4 or 7 or above character field starting with '20' */
END
ELSE DO /* first two digits not 20, must be a two-digit year */
THEN DO
IF length of field 3 is 3
THEN take first two digits as year, third as month
ELSEIF length of field 3 is 4
THEN take first two digits as year, third and fourth as month
ELSE *error* /* can't make any sense out of a 5 or above character field not starting with '20' */
END
But I'm highly doubtful. This looks like a huge unstructured data problem.
Tom
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You are correct; this is a huge unstructured data problem. I've been wracking my brain trying to figure out if there's any code that would force the values into uniformity but thinking that it may not be possible. I'll try out your code and report back; thanks for your response!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Can you do a mass find and replace in the Excel fiel for all CTRL+ENTER values and change them to an asterisk or some other symbol that can be used to parse the data later on?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If the Excel file had a CTRL+ENTER in the file and you convert it to CSV and read it in, I believe the return is kept.
That would help parse the data and avoid other issues...possibly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately when I convert the file to a CSV format, it screws with a lot of the other data, so I haven't been able to do that successfully. Otherwise, that would be a great call!