BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Anotherdream
Quartz | Level 8

Hello everyone. I Hvae a file that needs to be processed on a regular basis (daily). The issue is the file has specific date columns, of which the columns themselves do not come in the same order.

All of the dates within one column are the same, but from one file to another the dates can come (and do come) in any of the following formats.

MM/DD/YYYY

DD/MM/YYYY

YYYYDDMM

MMDDYYYY

etc.

Does anyone know of a way to dynamically determine a date informat from a file when it comes like this? I can't use proc import because it assigns values of 20120105 as numeric fields (cause techincally they are).  Is there perhaps a way to do a proc import and suggest to the system that the field has to be a valid informat from a date range?

Any and all suggestions are appreciated!

Brandon

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

No way ,unless you make a new variable to explain the current obs use what kind of format .


data have;
 input date : $20. fmt $20.;
 want_date=inputn(date,fmt);
 format want_date date9.;
cards;
01/21/2012 mmddyy10.
02/04/2014 ddmmyy10.
;
run;

Xia Keshan

View solution in original post

12 REPLIES 12
art297
Opal | Level 21

Brandon,

More info is needed.  If the data are coming in the form of Excel files, one can use dde to identify the formats that were actually used for each data point.

The anydtdte informat is quite good at capturing most dates, but not infallible.

Anotherdream
Quartz | Level 8

Hello Arthur.

So the files are indeed coming in Excel format in most cases, however in some cases they come in CSV formats. However getting them to come 100% in Excel could be done.

I am actually not sure how to use dde to identify the formats, however even if I could would that help? THe dates in the format of YYYYMMDD aren't actually formatted as dates, but are just numbers on a spreadsheet that represent dates with general formatting applied.

Not sure if that helped answer your question.

Brandon

art297
Opal | Level 21

One thing to try would be importing the excel files by using a libname statement, and then including the dbsastype option in the data step that you use to actually import the file.  Here is some pseudo code that assumes you have a workbook in c:\test called myexcel.xlsx and a date field called thedate:

libname excelxls Excel "c:\test\myexcel.xlsx;

data want;

set excelxls.'Sheet1$'n (DBSASTYPE=(thedate='date'));

run;

A paper that discusses the topic and provides some nice examples can be found at:

http://analytics.ncsu.edu/sesug/2011/CC13.Benjamin.pdf

Patrick
Opal | Level 21

Having "MM/DD/YYYY" and "DD/MM/YYYY" is an issue as for example "01102013" is possible for both patterns.


For Excel sources taking the approach Art suggests would circumvent this issue but you would need to code some extra logic for cvs sources. I believe for csv's the safest way would be to add an extra pass through the source analyzing the date strings to determine if it's MM/DD or DD/MM.

ballardw
Super User

The same extra pass could also look to id the YYYYMMDD and MMDDYYYY though this should be a simpler task.

stevejgoodman
Calcite | Level 5

Have you looked at the anydate informat? Works for me. You might need to use the datestyle option to help the compiler disambiguate month and day.

Alternatively take a look at this

http://support.sas.com/resources/papers/proceedings11/117-2011.pdf

Anotherdream
Quartz | Level 8

Hello. Sorry for the late response, loots of work issues came up.

Anyways, Patrick could you elaborate a bit on what you mean by adding an extra pass through the source analyzing the date strings? Do you basically mean to read in the file as character values, and then try each date format and find which one that works, and then save that information in a dataset and overwrite the default informat statements?

If so it looks like we'd have to read the file in twice, but if that is the price we have to pay for this to work  then that's acceptable, I just want to make sure we're on the same page!

Brandon

Patrick
Opal | Level 21

Yes, my thinking was that you need a first pass through the data where you determine what informat would work and then a second pass where you read the data with this informat.

When I've posted my answer I didn't spend a lot of thinking of how to actually implement this and I came now to realise that such an approach is basically re-building what "guessingrows" does in Proc Import.

If your data structure allows then I would suggest you're using Proc Import as coding a similar "guessingrows macro" which is as broad and robust could be quite a bit of work.

After running Proc Import you probably would need an additional mapping step so that the resulting data set variables have always the same attributes (length, format).

data _null_;
  file "C:/temp/test.csv";
  put "My_Date";
  do mydate='01jan2014'd to '31dec2014'd by 20;
    put mydate yymmdd10.;
/*    put mydate ddmmyy10.;*/
/*    put mydate mmddyy10.;*/
  end;
run;

proc import datafile="C:/temp/test.csv"
  out=test
  dbms=csv
  replace;
  delimiter=',';
  datarow=2;
  guessingrows=max;
  getnames=yes;
run;

proc datasets lib=work nolist;
  modify test;
    attrib My_Date format=date9.;
  run;
quit;

RichardinOz
Quartz | Level 8

Brandon

If this data is all internal to your organisation, and if the result of your work is important to the organisation then I think the way to solve this problem is to get agreement on specifications for formats of your inputs, and reject files that do not comply.  Your argument should be that since the data is important it should not be left to guessing game whether dates are in one format or another. An organisation that does not see the sense of this position is at risk.

More likely you are receiving data from external sources or incompatible systems, in which case you need to research what formats are being used in each case, and adapt your code to each source rather than taking a 'one size fits all' approach.  If you truly have sources which will randomly apply both DDMMYY and MMDDYY formats then the data simply cannot be relied on for any significant purpose.

Having said that you should note that the date formats in Excel workbooks are irrelevant:  Excel stores the date value as a count of days and fractions of a day which SAS is able to import accurately.  SAS will only use the format to identify that the value is a date, and to choose whether to convert to dates or datetime values in SAS.  Caveat: if dates are stored as character values in Excel (maybe prefaced by a hidden single quote, or with a trailing space) then you are stuck with having to make a conversion post import into SAS. 

CSV files are not an improvement because in them all values, including dates, are stored as character expressions.  Back to paras 1 & 2 above.

Richard in NZ

Anotherdream
Quartz | Level 8

Hello everyone. Sorry for the long delay.  This question was basically tabled due to other priorities, but figured now would be a good time to open it back up.

Richard:  My purpose in trying to dynamically determine a date informat is basically to save people who are new  sas programmers from trying to figure out what Date informat to use, because there are many of them and they can be confusing to people.

This also would result in more code re-usability.  If we could always read a csv file with all variables being Varchar, and then determine which variables are dates, this code could dynamically correctly read in dates.  This would drop A LOT of coding time, as having to know the correct date informat for every variable is a HUGE time-sink for new SAS users. Espeically those coming from sql where date conversion is done inherently for you (in sql '2014-01-31 and 01/31/2014 and Jan 1 2014' are all the same thing and are converted by default for you... This is basically what i'm trying to replicate).

   I agree with what you're saying and I think this is the path I'm going to ahve to go down.  I will have to look at a dataset, and keep all the variables that are 'dates' (in string form) and then try to determine how the strings are set up.

My next logical question would be: what is the best way to do this?

I'm thinking I would loop over each variable and then for any one variable, keep the first observation that is Non Null.  Then attempt several date informat types using an input statement.  ANy inputs that work and give correct numeric values will be the one used..    (Note I realize this is a problem with european dates of 01/05/2014 vs 05/01/2014).  

The problem I have next is what do we do about string dates that are not in any kind of reconizable sas format. Dates such as "jan 1 2014:04:03:023.203" or "2014-01-03 1:25 PM"   or "2014-01-03 12:32:35.025 AM".  I imagine we would have to create out own "informats" (strings of code that would read in these values) and test if they are converted correctly, and then if so stop at that point?

This one is a fun brain teaser.

Thanks all for your continued help.

Patrick
Opal | Level 21

For production data one has normally an interface contract with data dictionary so there shouldn't be any guesswork. I certainly wouldn't implement a "guessing step" adding an additional pass through the data for such situations.

For ad-hoc situations: Using the SAS EG import wizard does the job already for you. If required you then can also take the generated data step and amend it.

There are also the the any... informats for dates, times and datetimes which deal with multiple string patterns. If this is not enough then you can always create your own informats - eventually by nesting the new informat with an already existing one.

What you also can do is use PROC FCMP to build your own function which then tries a whole bunch of informats to select one which returns a non-missing value. You then can create your own informat where you call this function.

Having all that said: I believe it's a good thing that programmers must have a certain understanding of the data they are dealing with. I don't think it's that hard to get up-to-speed with formats and informats and I wouldn't leave everything to a fully automated process. The risk is that people then don't use their brains anymore and just trust the macro/function/format. That will go wrong sooner or later.

Ksharp
Super User

No way ,unless you make a new variable to explain the current obs use what kind of format .


data have;
 input date : $20. fmt $20.;
 want_date=inputn(date,fmt);
 format want_date date9.;
cards;
01/21/2012 mmddyy10.
02/04/2014 ddmmyy10.
;
run;

Xia Keshan

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 5604 views
  • 3 likes
  • 7 in conversation