Howdy folks!
This has been puzzling me for quite sometime now...I'm working with some very dirty data that came to me in the form of 41 Excel spreadsheets. One of the issues with this data is that some of the date values look like this:
That 9/26/201310/7/13 looked like this in the original data file:
So what I'm trying to figure out is - is there a way to remove everything in a variable value EXCEPT for the most recent date? That way all my single date values will remain the same, but all my multiple date values will only keep the most current and valid date.
Thanks so much for your time! I've tried searching for solutions to this but I think I'm just not using the right keywords.
Can you please type the values and not paste coz I am lazy to type in my SAS editor to write the program
@fwashburn please post the data as text, not an image. We'd have to type out your data to work with it, but it's easier if you do that instead of us 🙂
First, I'd do a countw with the slash as separator. If it's five, separate the first 4 characters (substr) from the third word, and the next part from position 5, so you can then convert the parts to dates (input to convert to numeric and mdy to build the dates).
Is there absolute consistency? You show two variants:
1. Three Five words separated with slashes. When this happens, the field is always m/d/y, with each word a varying number of digits. This one is easy.
2. Five words separated by slashes. In this case, the field appears to be m/d/x/d/y, with each word a varying number of digits. Words 1,2,4, and 5 are easy. Word 3 is easy, AS LONG AS THE YEAR IS ALWAYS FOUR DIGITS! You have cases in the other fields of a two-digit year. This will cause you huge problems.
All I can think of is this (pseudo code)
IF length of field 3 is less than 3, THEN *error* /* can't make a year and a month out of two or fewer digits. */
ELSE IF first two digits of field 3 are '20' /* has to be a four digit date, unless your years can include 2020 */
THEN DO
IF length of field 3 is 5
THEN take first four digits as year, fifth as month
ELSEIF length of field 3 is 6
THEN take first four digits as year, fifth and sixth as month
ELSE *error* /* can't make any sense out of a 3-4 or 7 or above character field starting with '20' */
END
ELSE DO /* first two digits not 20, must be a two-digit year */
THEN DO
IF length of field 3 is 3
THEN take first two digits as year, third as month
ELSEIF length of field 3 is 4
THEN take first two digits as year, third and fourth as month
ELSE *error* /* can't make any sense out of a 5 or above character field not starting with '20' */
END
But I'm highly doubtful. This looks like a huge unstructured data problem.
Tom
You are correct; this is a huge unstructured data problem. I've been wracking my brain trying to figure out if there's any code that would force the values into uniformity but thinking that it may not be possible. I'll try out your code and report back; thanks for your response!
Can you do a mass find and replace in the Excel fiel for all CTRL+ENTER values and change them to an asterisk or some other symbol that can be used to parse the data later on?
If the Excel file had a CTRL+ENTER in the file and you convert it to CSV and read it in, I believe the return is kept.
That would help parse the data and avoid other issues...possibly.
Unfortunately when I convert the file to a CSV format, it screws with a lot of the other data, so I haven't been able to do that successfully. Otherwise, that would be a great call!
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.