Solved: Re: SUBSTR to change dates as text to actual dates

Yves_Boonen · Posted 04-14-2016 07:32 AM

I have a CSV-file containing dossier numbers, dossier titles, dossier creators and dates (deadline, start date, close date, creation date). When a dossier is created, it will contain 5 sub dossiers. The first dossier will become active and the other four will become waiting. Cancelled and waiting dossiers do not have any dates, except for a creation date. Instead of showing dates, these cells will contain the value "nulldate". To make the import (date / infile / length / format / informat / input / keep) easier, I decided to import every column as a textstring. The idea was to change the format of the dates once required, using the SUBSTR command. (Date format is dd/mm/yyyy hh:mm.)

Example:

deadline	start_date	close_date	r_creation_date
11/04/2016 14:53	29/02/2016 14:53	nulldate	29/02/2016 14:53
11/04/2016 15:04	29/02/2016 15:04	30/03/2016 14:57	29/02/2016 15:04
nulldate	nulldate	nulldate	1/03/2016 11:04
8/03/2016 11:24	1/03/2016 11:24	7/03/2016 15:52	1/03/2016 11:24
nulldate	nulldate	nulldate	1/03/2016 11:24
25/04/2016 12:01	4/04/2016 12:01	nulldate	1/03/2016 11:24
15/03/2016 11:36	1/03/2016 11:35	1/03/2016 12:57	1/03/2016 11:35
15/03/2016 10:41	1/03/2016 10:41	1/03/2016 10:41	1/03/2016 10:40
19/04/2016 12:27	8/03/2016 12:27	nulldate	1/03/2016 10:55
8/03/2016 10:58	1/03/2016 10:58	7/03/2016 14:15	1/03/2016 10:58
nulldate	nulldate	nulldate	1/03/2016 11:11
nulldate	nulldate	nulldate	1/03/2016 11:14
15/04/2016 11:46	1/04/2016 11:46	nulldate	1/03/2016 11:15
8/03/2016 11:15	1/03/2016 11:15	1/04/2016 11:44	1/03/2016 11:15
nulldate	nulldate	nulldate	1/03/2016 10:36
nulldate	nulldate	nulldate	1/03/2016 11:35
nulldate	nulldate	nulldate	1/03/2016 10:41
28/03/2016 9:56	14/03/2016 9:56	nulldate	1/03/2016 11:18
21/03/2016 14:21	29/02/2016 14:21	16/03/2016 10:52	29/02/2016 14:21
21/03/2016 14:34	29/02/2016 14:34	17/03/2016 12:04	29/02/2016 14:34

How would I go about using SUBSTR to change the format from $char16. to the actual format used through a datastep in my query (not the query I used for importing, but the query I used to analyse my data).

Additionally, later on I will need to compare some of these dates to a fixed date. This is because I am working with exports from a database and so the fixed date becomes the date of the export day. (%LET date=1APR2016;)

It doesn't matter which format it ends up using, as long as I can make the analysis. Any suggestions how I can start this?

Kurt_Bremser · Posted 04-14-2016 09:53 AM

Since r_creation_date is a timestamp, don't include it in the length statement at all. Timestamps are best left at the default length of 8 (numeric).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Yves_Boonen · Posted 04-14-2016 07:44 AM

To clarify:

I have to check if the dossiers were handled before the deadline. For dossiers that are completed, a close date is available. The calculation will become like this:

  when ('DONE') do;
    	done+1;
    	if close_date > deadline then done_overtime+1;
  end;

However, for dossiers that are still active, there will be no close date available. In that case I will need to compare it to another date instead. Since the export I am currently using, was created on the 1st of April, this fixed date has to be April 1st, 2016. The calculation will then become like this:

when ('ACTIVE') do;
    	active+1;
    	if date() > deadline then active_overtime+1;
end;

Naturally this calculation won't work, since SAS thinks the dates are a textstring and not dates. Importing the dates as textstrings is a necessity because not only does it actually contain a textstring ("nulldate"), I need to import the data from the CSV-file untouched. A colelague pointed out that the best approach would be to use the import query for importing everything as a textstring and then later on use the SUBSTR command to change the format in the datastep from your actual analysis.

Kurt_Bremser · Posted 04-14-2016 07:53 AM

Hands down, the best method to deal with these problems is to store dates and timestamps as intended, and use formats for display or output as necessary; all comparisons and mathematical operations become a breeze instead of a massive PITA.

So the way to go in your situation is to

- read the column as character

- check if it contains "nulldate"

- if yes, set (numerical) date/timestamp variable to missing (.)

- else convert with proper format

- drop temporary character variable

If you want to preserve the original unconverted data, add a sufficiently large character variable in the data set and fill it with the contents of _infile_.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Yves_Boonen · Posted 04-14-2016 08:02 AM

What you are suggesting makes sense. If I understand correctly, I would have to do this during the import, don't I? Just in case I will show you the import query I used.

data unload.task;
infile '\\networklocation\unloads\JNX\JNX_unload_tasks_ 201603_.csv' dlm=';' truncover firstobs=2 lrecl=500;

length
  r_object_id $16
  object_name $70
  deadline $16
  start_date $16
  close_date $16
  task_handler $25
  task_state $9
  dossier_nr $6
  sequence $1
  r_creation_date $16
;

format
  r_object_id $char16.
  object_name $char70.
  deadline $char16.
  start_date $char16.
  close_date $char16.
  task_handler $char25.
  task_state $char9.
  dossier_nr $char6.
  sequence $char1.
  r_creation_date $char16.
;

informat
  r_object_id $char16.
  object_name $char70.
  deadline $char16.
  start_date $char16.
  close_date $char16.
  task_handler $char25.
  task_state $char9.
  dossier_nr $char6.
  sequence $char1.
  r_creation_date $char16.
;

input
  r_object_id
  object_name
  deadline
  start_date
  close_date
  task_handler
  task_state
  dossier_nr
  sequence
  r_creation_date
;

keep
  r_object_id
  object_name 
  deadline
  start_date
  close_date
  task_handler
  task_state
  dossier_nr
  sequence
  r_creation_date
;
run;

This brings us back to the format of the dates. The format itself is should be date10., but that would be without the "hh:mm". I thought the SUBSTR could manage that by saying something like: "new_deadline=SUBSTR(deadline,1,10)", but I can't get that to work at all.

RW9 · Posted 04-14-2016 08:02 AM

You would use inputs and puts to arrange your data after the fact - not sure why you don't just update this on the read in? Here is an example of creating date, time, and datetime variables:

data have;
  length deadline  start_date 	close_date r_creation_date $50; 
  infile datalines dlm=",";
  input deadline $ start_date $	close_date $ r_creation_date $;
datalines;
11/04/2016 14:53,29/02/2016 14:53,nulldate,	29/02/2016 14:53
11/04/2016 15:04,29/02/2016 15:04,30/03/2016 14:57,	29/02/2016 15:04
nulldate,nulldate,nulldate,1/03/2016 11:04
;
run;

data want;
  set have;
  if deadline ne "nulldate" then do;
    deadlined=input(scan(deadline,1," "),ddmmyy10.);
    deadlinet=input(scan(deadline,2," "),time5.);
    deadlinedt=input(put(deadlined,yymmdd10.)||"T"||put(deadlinet,tod5.)||":00",e8601dt.);
  end;
  format deadlinedt datetime. deadlined date9. deadlinet time5.;
run;

Yves_Boonen · Posted 04-14-2016 08:13 AM

I think I see your point. If I were to do this in the analysis query, the query itself would become lengthy, the query would become very unorderly and perhaps even slow to execute?

If I were to do this from the import, could I still do format $char16. and then use the proper informat? But is there a format that sticks to the used format? Datetime16. perhaps? But how will the import handle the nulldate values if I use that format?

RW9 · Posted 04-14-2016 08:24 AM

You should be able to use:

informat <your_variable> anydtdtm.;

In your import code, this should read it in as a date time. If the import value is missing, then it will be missing in the output variable. If you actually have the word "nulldate" present in your import file then you have problems, and that needs to be removed before importing - personally I wouldn't accept a file which had that kind of thing in it.

See SAS help:

http://support.sas.com/documentation/cdl/en/leforinforref/63324/HTML/default/viewer.htm#p1hsn1ji141r...

On your other point, yes, doing conversions in a query would take longer, the functions get called each row. Better to have data in a good strcuture before starting to process it.

Yves_Boonen · Posted 04-14-2016 08:27 AM

What if I were to do the unthinkable and alter my unload? Only alter it by changing "nulldate" with "-". If I am not mistaken, then the format you mentioned above should work, right?

RW9 · Posted 04-14-2016 08:32 AM

I would leave it as "", i.e. toally empty.

But then I would also have a data import agreement in place signed off by both parties, which detailed, as receiver of the data, the exact structure, file format, and contents of the data file as I want it to be (or near as). To not have any of this makes me a very sad importer 🐵

Yves_Boonen · Posted 04-14-2016 08:37 AM

If I do that in an import, all my cells get shifted. For example, a CSV-file:

(First row contains variable names.)

A;B;C;D;E

X;X;;Y;Z

X;X;Y;Y;Z

Will give me a table like:

A	B	C	D	E
X	X	Y	Z
X	X	Y	Y	Z

Even though you can clearly see that the it says "X;X;;Y;Z", it somehow ignores the ";;" and makes all my cells move one to the left. The unloads I have been using, have quite a few empty cells and so this became a problem. The quick and easy fix was to fill every empty cell with "-". So the CSV-file would look like:

A;B;C;D;E

X;X;-;Y;Z

X;X;Y;Y;Z

Will give me a table like this:

A	B	C	D	E
X	X	-	Y	Z
X	X	Y	Y	Z

That makes me wonder, is that normal? If so, would it be wise to replace nulldate with an empty cell?

RW9 · Posted 04-14-2016 09:08 AM

Change your infile "..." statement to include (either) DSD or missover option - can't think which it is, think its DSD:

infile "...\textfile.txt" dlm=";" dsd;

Kurt_Bremser · Posted 04-14-2016 09:14 AM

When this kind of shifting happens, it can usually be corrected by using the dsd option in the infile statement. This causes two delimiters in sequence to be interpreted as a missing value. Without dsd, the two delimiters are taken as one.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Yves_Boonen · Posted 04-14-2016 09:20 AM

That works wonders! Thanks!

However, I still have an error with my anydate.

Imagine just having a CSV-file with one column "dates" and in it a bunch of dates in the format "dd/mm/yy hh:mm". The code I have right now, looks like:

data dates;
infile 'C:\temp\dates.csv'  dlm=';' truncover firstobs=2 lrecl=500;

length
  date $16
;

format
  date $char16.
;

informat
  date anydtdtm16.
;

input
  date
;

keep
  date
;

run;

It results with an error in the log (from the original query and table):

1542    r_creation_date anydtdtm16.
                        -----------
                        48
ERROR 48-59: The informat $ANYDTDTM was not found or could not be loaded.

I guess I must be missing something somewhere? Do I need to adjust something in the format or even the input?

Kurt_Bremser · Posted 04-14-2016 09:28 AM

You try to assign a numerical format to a variable that was already defined as character. SAS then tries to correct your "obvious" mistake by searching for a proper character format (that would be named $anydtdtm), which does not exist.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Yves_Boonen · Posted 04-14-2016 09:46 AM

Oh, stupid me... I should remove the "$" in the length, shouldn't I?

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away