SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

How to remove extraneous dates from variables with multiple dates listed

Reply
Occasional Contributor
Posts: 8

How to remove extraneous dates from variables with multiple dates listed

Howdy folks!

 

This has been puzzling me for quite sometime now...I'm working with some very dirty data that came to me in the form of 41 Excel spreadsheets. One of the issues with this data is that some of the date values look like this:

 

dt_inspect_hfh.PNG

 

That 9/26/201310/7/13 looked like this in the original data file:

 

Capture2.PNG

 

So what I'm trying to figure out is - is there a way to remove everything in a variable value EXCEPT for the most recent date? That way all my single date values will remain the same, but all my multiple date values will only keep the most current and valid date.

 

Thanks so much for your time! I've tried searching for solutions to this but I think I'm just not using the right keywords.

PROC Star
Posts: 1,282

Re: How to remove extraneous dates from variables with multiple dates listed

Posted in reply to fwashburn

Can you please type the values and not paste coz I am lazy to type in my SAS editor to write the program

Super User
Posts: 22,818

Re: How to remove extraneous dates from variables with multiple dates listed

Posted in reply to fwashburn

@fwashburn please post the data as text, not an image. We'd have to type out your data to work with it, but it's easier if you do that instead of us Smiley Happy

Super User
Posts: 9,548

Re: How to remove extraneous dates from variables with multiple dates listed

Posted in reply to fwashburn

First, I'd do a countw with the slash as separator. If it's five, separate the first 4 characters (substr) from the third word, and the next part from position 5, so you can then convert the parts to dates (input to convert to numeric and mdy to build the dates).

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
PROC Star
Posts: 1,257

Re: How to remove extraneous dates from variables with multiple dates listed

Posted in reply to fwashburn

Is there absolute consistency? You show two variants:

 

1. Three Five words separated with slashes. When this happens, the field is always m/d/y, with each word a varying number of digits. This one is easy.

 

2. Five words separated by slashes. In this case, the field appears to be m/d/x/d/y, with each word a varying number of digits. Words 1,2,4, and 5 are easy. Word 3 is easy, AS LONG AS THE YEAR IS ALWAYS FOUR DIGITS! You have cases in the other fields of a two-digit year. This will cause you huge problems.

All I can think of is this (pseudo code)
IF length of field 3 is less than 3, THEN *error* /* can't make a year and a month out of two or fewer digits. */
ELSE IF first two digits of field 3 are '20' /* has to be a four digit date, unless your years can include 2020 */
     THEN DO
        IF length of field 3 is 5
        THEN take first four digits as year, fifth as month
        ELSEIF length of field 3 is 6
        THEN take first four digits as year, fifth and sixth as month
        ELSE *error* /* can't make any sense out of a 3-4 or 7 or above character field starting with '20' */
     END
ELSE DO /* first two digits not 20, must be a two-digit year */
     THEN DO
        IF length of field 3 is 3
        THEN take first two digits as year, third as month
        ELSEIF length of field 3 is 4
        THEN take first two digits as year, third and fourth as month
        ELSE *error* /* can't make any sense out of a 5 or above character field not starting with '20' */
     END

 

But I'm highly doubtful. This looks like a huge unstructured data problem.

 

Tom

Occasional Contributor
Posts: 8

Re: How to remove extraneous dates from variables with multiple dates listed

You are correct; this is a huge unstructured data problem. I've been wracking my brain trying to figure out if there's any code that would force the values into uniformity but thinking that it may not be possible. I'll try out your code and report back; thanks for your response!

Super User
Posts: 22,818

Re: How to remove extraneous dates from variables with multiple dates listed

Posted in reply to fwashburn

Can you do a mass find and replace in the Excel fiel for all CTRL+ENTER values and change them to an asterisk or some other symbol that can be used to parse the data later on?

Super User
Posts: 22,818

Re: How to remove extraneous dates from variables with multiple dates listed

Posted in reply to fwashburn

If the Excel file had a CTRL+ENTER in the file and you convert it to CSV and read it in, I believe the return is kept.

That would help parse the data and avoid other issues...possibly.

Occasional Contributor
Posts: 8

Re: How to remove extraneous dates from variables with multiple dates listed

Unfortunately when I convert the file to a CSV format, it screws with a lot of the other data, so I haven't been able to do that successfully. Otherwise, that would be a great call!

Ask a Question
Discussion stats
  • 8 replies
  • 301 views
  • 3 likes
  • 5 in conversation