I have a CSV-file containing dossier numbers, dossier titles, dossier creators and dates (deadline, start date, close date, creation date). When a dossier is created, it will contain 5 sub dossiers. The first dossier will become active and the other four will become waiting. Cancelled and waiting dossiers do not have any dates, except for a creation date. Instead of showing dates, these cells will contain the value "nulldate". To make the import (date / infile / length / format / informat / input / keep) easier, I decided to import every column as a textstring. The idea was to change the format of the dates once required, using the SUBSTR command. (Date format is dd/mm/yyyy hh:mm.)
Example:
deadline | start_date | close_date | r_creation_date |
11/04/2016 14:53 | 29/02/2016 14:53 | nulldate | 29/02/2016 14:53 |
11/04/2016 15:04 | 29/02/2016 15:04 | 30/03/2016 14:57 | 29/02/2016 15:04 |
nulldate | nulldate | nulldate | 1/03/2016 11:04 |
8/03/2016 11:24 | 1/03/2016 11:24 | 7/03/2016 15:52 | 1/03/2016 11:24 |
nulldate | nulldate | nulldate | 1/03/2016 11:24 |
25/04/2016 12:01 | 4/04/2016 12:01 | nulldate | 1/03/2016 11:24 |
15/03/2016 11:36 | 1/03/2016 11:35 | 1/03/2016 12:57 | 1/03/2016 11:35 |
15/03/2016 10:41 | 1/03/2016 10:41 | 1/03/2016 10:41 | 1/03/2016 10:40 |
19/04/2016 12:27 | 8/03/2016 12:27 | nulldate | 1/03/2016 10:55 |
8/03/2016 10:58 | 1/03/2016 10:58 | 7/03/2016 14:15 | 1/03/2016 10:58 |
nulldate | nulldate | nulldate | 1/03/2016 11:11 |
nulldate | nulldate | nulldate | 1/03/2016 11:14 |
15/04/2016 11:46 | 1/04/2016 11:46 | nulldate | 1/03/2016 11:15 |
8/03/2016 11:15 | 1/03/2016 11:15 | 1/04/2016 11:44 | 1/03/2016 11:15 |
nulldate | nulldate | nulldate | 1/03/2016 10:36 |
nulldate | nulldate | nulldate | 1/03/2016 11:35 |
nulldate | nulldate | nulldate | 1/03/2016 10:41 |
28/03/2016 9:56 | 14/03/2016 9:56 | nulldate | 1/03/2016 11:18 |
21/03/2016 14:21 | 29/02/2016 14:21 | 16/03/2016 10:52 | 29/02/2016 14:21 |
21/03/2016 14:34 | 29/02/2016 14:34 | 17/03/2016 12:04 | 29/02/2016 14:34 |
How would I go about using SUBSTR to change the format from $char16. to the actual format used through a datastep in my query (not the query I used for importing, but the query I used to analyse my data).
Additionally, later on I will need to compare some of these dates to a fixed date. This is because I am working with exports from a database and so the fixed date becomes the date of the export day. (%LET date=1APR2016;)
It doesn't matter which format it ends up using, as long as I can make the analysis. Any suggestions how I can start this?
Since r_creation_date is a timestamp, don't include it in the length statement at all. Timestamps are best left at the default length of 8 (numeric).
To clarify:
I have to check if the dossiers were handled before the deadline. For dossiers that are completed, a close date is available. The calculation will become like this:
when ('DONE') do;
done+1;
if close_date > deadline then done_overtime+1;
end;
However, for dossiers that are still active, there will be no close date available. In that case I will need to compare it to another date instead. Since the export I am currently using, was created on the 1st of April, this fixed date has to be April 1st, 2016. The calculation will then become like this:
when ('ACTIVE') do;
active+1;
if date() > deadline then active_overtime+1;
end;
Naturally this calculation won't work, since SAS thinks the dates are a textstring and not dates. Importing the dates as textstrings is a necessity because not only does it actually contain a textstring ("nulldate"), I need to import the data from the CSV-file untouched. A colelague pointed out that the best approach would be to use the import query for importing everything as a textstring and then later on use the SUBSTR command to change the format in the datastep from your actual analysis.
Hands down, the best method to deal with these problems is to store dates and timestamps as intended, and use formats for display or output as necessary; all comparisons and mathematical operations become a breeze instead of a massive PITA.
So the way to go in your situation is to
- read the column as character
- check if it contains "nulldate"
- if yes, set (numerical) date/timestamp variable to missing (.)
- else convert with proper format
- drop temporary character variable
If you want to preserve the original unconverted data, add a sufficiently large character variable in the data set and fill it with the contents of _infile_.
What you are suggesting makes sense. If I understand correctly, I would have to do this during the import, don't I? Just in case I will show you the import query I used.
data unload.task;
infile '\\networklocation\unloads\JNX\JNX_unload_tasks_ 201603_.csv' dlm=';' truncover firstobs=2 lrecl=500;
length
r_object_id $16
object_name $70
deadline $16
start_date $16
close_date $16
task_handler $25
task_state $9
dossier_nr $6
sequence $1
r_creation_date $16
;
format
r_object_id $char16.
object_name $char70.
deadline $char16.
start_date $char16.
close_date $char16.
task_handler $char25.
task_state $char9.
dossier_nr $char6.
sequence $char1.
r_creation_date $char16.
;
informat
r_object_id $char16.
object_name $char70.
deadline $char16.
start_date $char16.
close_date $char16.
task_handler $char25.
task_state $char9.
dossier_nr $char6.
sequence $char1.
r_creation_date $char16.
;
input
r_object_id
object_name
deadline
start_date
close_date
task_handler
task_state
dossier_nr
sequence
r_creation_date
;
keep
r_object_id
object_name
deadline
start_date
close_date
task_handler
task_state
dossier_nr
sequence
r_creation_date
;
run;
This brings us back to the format of the dates. The format itself is should be date10., but that would be without the "hh:mm". I thought the SUBSTR could manage that by saying something like: "new_deadline=SUBSTR(deadline,1,10)", but I can't get that to work at all.
You would use inputs and puts to arrange your data after the fact - not sure why you don't just update this on the read in? Here is an example of creating date, time, and datetime variables:
data have; length deadline start_date close_date r_creation_date $50; infile datalines dlm=","; input deadline $ start_date $ close_date $ r_creation_date $; datalines; 11/04/2016 14:53,29/02/2016 14:53,nulldate, 29/02/2016 14:53 11/04/2016 15:04,29/02/2016 15:04,30/03/2016 14:57, 29/02/2016 15:04 nulldate,nulldate,nulldate,1/03/2016 11:04 ; run; data want; set have; if deadline ne "nulldate" then do; deadlined=input(scan(deadline,1," "),ddmmyy10.); deadlinet=input(scan(deadline,2," "),time5.); deadlinedt=input(put(deadlined,yymmdd10.)||"T"||put(deadlinet,tod5.)||":00",e8601dt.); end; format deadlinedt datetime. deadlined date9. deadlinet time5.; run;
I think I see your point. If I were to do this in the analysis query, the query itself would become lengthy, the query would become very unorderly and perhaps even slow to execute?
If I were to do this from the import, could I still do format $char16. and then use the proper informat? But is there a format that sticks to the used format? Datetime16. perhaps? But how will the import handle the nulldate values if I use that format?
You should be able to use:
informat <your_variable> anydtdtm.;
In your import code, this should read it in as a date time. If the import value is missing, then it will be missing in the output variable. If you actually have the word "nulldate" present in your import file then you have problems, and that needs to be removed before importing - personally I wouldn't accept a file which had that kind of thing in it.
See SAS help:
On your other point, yes, doing conversions in a query would take longer, the functions get called each row. Better to have data in a good strcuture before starting to process it.
What if I were to do the unthinkable and alter my unload? Only alter it by changing "nulldate" with "-". If I am not mistaken, then the format you mentioned above should work, right?
I would leave it as "", i.e. toally empty.
But then I would also have a data import agreement in place signed off by both parties, which detailed, as receiver of the data, the exact structure, file format, and contents of the data file as I want it to be (or near as). To not have any of this makes me a very sad importer 🐵
If I do that in an import, all my cells get shifted. For example, a CSV-file:
(First row contains variable names.)
A;B;C;D;E
X;X;;Y;Z
X;X;Y;Y;Z
Will give me a table like:
A | B | C | D | E |
X | X | Y | Z | |
X | X | Y | Y | Z |
Even though you can clearly see that the it says "X;X;;Y;Z", it somehow ignores the ";;" and makes all my cells move one to the left. The unloads I have been using, have quite a few empty cells and so this became a problem. The quick and easy fix was to fill every empty cell with "-". So the CSV-file would look like:
A;B;C;D;E
X;X;-;Y;Z
X;X;Y;Y;Z
Will give me a table like this:
A | B | C | D | E |
X | X | - | Y | Z |
X | X | Y | Y | Z |
That makes me wonder, is that normal? If so, would it be wise to replace nulldate with an empty cell?
Change your infile "..." statement to include (either) DSD or missover option - can't think which it is, think its DSD:
infile "...\textfile.txt" dlm=";" dsd;
When this kind of shifting happens, it can usually be corrected by using the dsd option in the infile statement. This causes two delimiters in sequence to be interpreted as a missing value. Without dsd, the two delimiters are taken as one.
That works wonders! Thanks!
However, I still have an error with my anydate.
Imagine just having a CSV-file with one column "dates" and in it a bunch of dates in the format "dd/mm/yy hh:mm". The code I have right now, looks like:
data dates;
infile 'C:\temp\dates.csv' dlm=';' truncover firstobs=2 lrecl=500;
length
date $16
;
format
date $char16.
;
informat
date anydtdtm16.
;
input
date
;
keep
date
;
run;
It results with an error in the log (from the original query and table):
1542 r_creation_date anydtdtm16.
-----------
48
ERROR 48-59: The informat $ANYDTDTM was not found or could not be loaded.
I guess I must be missing something somewhere? Do I need to adjust something in the format or even the input?
You try to assign a numerical format to a variable that was already defined as character. SAS then tries to correct your "obvious" mistake by searching for a proper character format (that would be named $anydtdtm), which does not exist.
Oh, stupid me... I should remove the "$" in the length, shouldn't I?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.