BookmarkSubscribeRSS Feed
fwashburn
Fluorite | Level 6

Howdy folks!

 

This has been puzzling me for quite sometime now...I'm working with some very dirty data that came to me in the form of 41 Excel spreadsheets. One of the issues with this data is that some of the date values look like this:

 

dt_inspect_hfh.PNG

 

That 9/26/201310/7/13 looked like this in the original data file:

 

Capture2.PNG

 

So what I'm trying to figure out is - is there a way to remove everything in a variable value EXCEPT for the most recent date? That way all my single date values will remain the same, but all my multiple date values will only keep the most current and valid date.

 

Thanks so much for your time! I've tried searching for solutions to this but I think I'm just not using the right keywords.

8 REPLIES 8
novinosrin
Tourmaline | Level 20

Can you please type the values and not paste coz I am lazy to type in my SAS editor to write the program

Reeza
Super User

@fwashburn please post the data as text, not an image. We'd have to type out your data to work with it, but it's easier if you do that instead of us 🙂

Kurt_Bremser
Super User

First, I'd do a countw with the slash as separator. If it's five, separate the first 4 characters (substr) from the third word, and the next part from position 5, so you can then convert the parts to dates (input to convert to numeric and mdy to build the dates).

TomKari
Onyx | Level 15

Is there absolute consistency? You show two variants:

 

1. Three Five words separated with slashes. When this happens, the field is always m/d/y, with each word a varying number of digits. This one is easy.

 

2. Five words separated by slashes. In this case, the field appears to be m/d/x/d/y, with each word a varying number of digits. Words 1,2,4, and 5 are easy. Word 3 is easy, AS LONG AS THE YEAR IS ALWAYS FOUR DIGITS! You have cases in the other fields of a two-digit year. This will cause you huge problems.

All I can think of is this (pseudo code)
IF length of field 3 is less than 3, THEN *error* /* can't make a year and a month out of two or fewer digits. */
ELSE IF first two digits of field 3 are '20' /* has to be a four digit date, unless your years can include 2020 */
     THEN DO
        IF length of field 3 is 5
        THEN take first four digits as year, fifth as month
        ELSEIF length of field 3 is 6
        THEN take first four digits as year, fifth and sixth as month
        ELSE *error* /* can't make any sense out of a 3-4 or 7 or above character field starting with '20' */
     END
ELSE DO /* first two digits not 20, must be a two-digit year */
     THEN DO
        IF length of field 3 is 3
        THEN take first two digits as year, third as month
        ELSEIF length of field 3 is 4
        THEN take first two digits as year, third and fourth as month
        ELSE *error* /* can't make any sense out of a 5 or above character field not starting with '20' */
     END

 

But I'm highly doubtful. This looks like a huge unstructured data problem.

 

Tom

fwashburn
Fluorite | Level 6

You are correct; this is a huge unstructured data problem. I've been wracking my brain trying to figure out if there's any code that would force the values into uniformity but thinking that it may not be possible. I'll try out your code and report back; thanks for your response!

Reeza
Super User

Can you do a mass find and replace in the Excel fiel for all CTRL+ENTER values and change them to an asterisk or some other symbol that can be used to parse the data later on?

Reeza
Super User

If the Excel file had a CTRL+ENTER in the file and you convert it to CSV and read it in, I believe the return is kept.

That would help parse the data and avoid other issues...possibly.

fwashburn
Fluorite | Level 6

Unfortunately when I convert the file to a CSV format, it screws with a lot of the other data, so I haven't been able to do that successfully. Otherwise, that would be a great call!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1067 views
  • 3 likes
  • 5 in conversation