04-14-2016 07:32 AM - edited 04-14-2016 07:46 AM
I have a CSV-file containing dossier numbers, dossier titles, dossier creators and dates (deadline, start date, close date, creation date). When a dossier is created, it will contain 5 sub dossiers. The first dossier will become active and the other four will become waiting. Cancelled and waiting dossiers do not have any dates, except for a creation date. Instead of showing dates, these cells will contain the value "nulldate". To make the import (date / infile / length / format / informat / input / keep) easier, I decided to import every column as a textstring. The idea was to change the format of the dates once required, using the SUBSTR command. (Date format is dd/mm/yyyy hh:mm.)
|11/04/2016 14:53||29/02/2016 14:53||nulldate||29/02/2016 14:53|
|11/04/2016 15:04||29/02/2016 15:04||30/03/2016 14:57||29/02/2016 15:04|
|8/03/2016 11:24||1/03/2016 11:24||7/03/2016 15:52||1/03/2016 11:24|
|25/04/2016 12:01||4/04/2016 12:01||nulldate||1/03/2016 11:24|
|15/03/2016 11:36||1/03/2016 11:35||1/03/2016 12:57||1/03/2016 11:35|
|15/03/2016 10:41||1/03/2016 10:41||1/03/2016 10:41||1/03/2016 10:40|
|19/04/2016 12:27||8/03/2016 12:27||nulldate||1/03/2016 10:55|
|8/03/2016 10:58||1/03/2016 10:58||7/03/2016 14:15||1/03/2016 10:58|
|15/04/2016 11:46||1/04/2016 11:46||nulldate||1/03/2016 11:15|
|8/03/2016 11:15||1/03/2016 11:15||1/04/2016 11:44||1/03/2016 11:15|
|28/03/2016 9:56||14/03/2016 9:56||nulldate||1/03/2016 11:18|
|21/03/2016 14:21||29/02/2016 14:21||16/03/2016 10:52||29/02/2016 14:21|
|21/03/2016 14:34||29/02/2016 14:34||17/03/2016 12:04||29/02/2016 14:34|
How would I go about using SUBSTR to change the format from $char16. to the actual format used through a datastep in my query (not the query I used for importing, but the query I used to analyse my data).
Additionally, later on I will need to compare some of these dates to a fixed date. This is because I am working with exports from a database and so the fixed date becomes the date of the export day. (%LET date=1APR2016;)
It doesn't matter which format it ends up using, as long as I can make the analysis. Any suggestions how I can start this?
04-14-2016 09:53 AM
Since r_creation_date is a timestamp, don't include it in the length statement at all. Timestamps are best left at the default length of 8 (numeric).
04-14-2016 07:44 AM
I have to check if the dossiers were handled before the deadline. For dossiers that are completed, a close date is available. The calculation will become like this:
when ('DONE') do; done+1; if close_date > deadline then done_overtime+1; end;
However, for dossiers that are still active, there will be no close date available. In that case I will need to compare it to another date instead. Since the export I am currently using, was created on the 1st of April, this fixed date has to be April 1st, 2016. The calculation will then become like this:
when ('ACTIVE') do; active+1; if date() > deadline then active_overtime+1; end;
Naturally this calculation won't work, since SAS thinks the dates are a textstring and not dates. Importing the dates as textstrings is a necessity because not only does it actually contain a textstring ("nulldate"), I need to import the data from the CSV-file untouched. A colelague pointed out that the best approach would be to use the import query for importing everything as a textstring and then later on use the SUBSTR command to change the format in the datastep from your actual analysis.
04-14-2016 07:53 AM
Hands down, the best method to deal with these problems is to store dates and timestamps as intended, and use formats for display or output as necessary; all comparisons and mathematical operations become a breeze instead of a massive PITA.
So the way to go in your situation is to
- read the column as character
- check if it contains "nulldate"
- if yes, set (numerical) date/timestamp variable to missing (.)
- else convert with proper format
- drop temporary character variable
If you want to preserve the original unconverted data, add a sufficiently large character variable in the data set and fill it with the contents of _infile_.
04-14-2016 08:02 AM
What you are suggesting makes sense. If I understand correctly, I would have to do this during the import, don't I? Just in case I will show you the import query I used.
data unload.task; infile '\\networklocation\unloads\JNX\JNX_unload_tasks_ 201603_.csv' dlm=';' truncover firstobs=2 lrecl=500; length r_object_id $16 object_name $70 deadline $16 start_date $16 close_date $16 task_handler $25 task_state $9 dossier_nr $6 sequence $1 r_creation_date $16 ; format r_object_id $char16. object_name $char70. deadline $char16. start_date $char16. close_date $char16. task_handler $char25. task_state $char9. dossier_nr $char6. sequence $char1. r_creation_date $char16. ; informat r_object_id $char16. object_name $char70. deadline $char16. start_date $char16. close_date $char16. task_handler $char25. task_state $char9. dossier_nr $char6. sequence $char1. r_creation_date $char16. ; input r_object_id object_name deadline start_date close_date task_handler task_state dossier_nr sequence r_creation_date ; keep r_object_id object_name deadline start_date close_date task_handler task_state dossier_nr sequence r_creation_date ; run;
This brings us back to the format of the dates. The format itself is should be date10., but that would be without the "hh:mm". I thought the SUBSTR could manage that by saying something like: "new_deadline=SUBSTR(deadline,1,10)", but I can't get that to work at all.
04-14-2016 08:02 AM
You would use inputs and puts to arrange your data after the fact - not sure why you don't just update this on the read in? Here is an example of creating date, time, and datetime variables:
data have; length deadline start_date close_date r_creation_date $50; infile datalines dlm=","; input deadline $ start_date $ close_date $ r_creation_date $; datalines; 11/04/2016 14:53,29/02/2016 14:53,nulldate, 29/02/2016 14:53 11/04/2016 15:04,29/02/2016 15:04,30/03/2016 14:57, 29/02/2016 15:04 nulldate,nulldate,nulldate,1/03/2016 11:04 ; run; data want; set have; if deadline ne "nulldate" then do; deadlined=input(scan(deadline,1," "),ddmmyy10.); deadlinet=input(scan(deadline,2," "),time5.); deadlinedt=input(put(deadlined,yymmdd10.)||"T"||put(deadlinet,tod5.)||":00",e8601dt.); end; format deadlinedt datetime. deadlined date9. deadlinet time5.; run;
04-14-2016 08:13 AM - edited 04-14-2016 08:14 AM
I think I see your point. If I were to do this in the analysis query, the query itself would become lengthy, the query would become very unorderly and perhaps even slow to execute?
If I were to do this from the import, could I still do format $char16. and then use the proper informat? But is there a format that sticks to the used format? Datetime16. perhaps? But how will the import handle the nulldate values if I use that format?
04-14-2016 08:24 AM
You should be able to use:
informat <your_variable> anydtdtm.;
In your import code, this should read it in as a date time. If the import value is missing, then it will be missing in the output variable. If you actually have the word "nulldate" present in your import file then you have problems, and that needs to be removed before importing - personally I wouldn't accept a file which had that kind of thing in it.
See SAS help:
On your other point, yes, doing conversions in a query would take longer, the functions get called each row. Better to have data in a good strcuture before starting to process it.
04-14-2016 08:27 AM
What if I were to do the unthinkable and alter my unload? Only alter it by changing "nulldate" with "-". If I am not mistaken, then the format you mentioned above should work, right?
04-14-2016 08:32 AM
I would leave it as "", i.e. toally empty.
But then I would also have a data import agreement in place signed off by both parties, which detailed, as receiver of the data, the exact structure, file format, and contents of the data file as I want it to be (or near as). To not have any of this makes me a very sad importer )
04-14-2016 08:37 AM - edited 04-14-2016 08:40 AM
If I do that in an import, all my cells get shifted. For example, a CSV-file:
(First row contains variable names.)
Will give me a table like:
Even though you can clearly see that the it says "X;X;;Y;Z", it somehow ignores the ";;" and makes all my cells move one to the left. The unloads I have been using, have quite a few empty cells and so this became a problem. The quick and easy fix was to fill every empty cell with "-". So the CSV-file would look like:
Will give me a table like this:
That makes me wonder, is that normal? If so, would it be wise to replace nulldate with an empty cell?
04-14-2016 09:08 AM
Change your infile "..." statement to include (either) DSD or missover option - can't think which it is, think its DSD:
infile "...\textfile.txt" dlm=";" dsd;
04-14-2016 09:14 AM
When this kind of shifting happens, it can usually be corrected by using the dsd option in the infile statement. This causes two delimiters in sequence to be interpreted as a missing value. Without dsd, the two delimiters are taken as one.
04-14-2016 09:20 AM
That works wonders! Thanks!
However, I still have an error with my anydate.
Imagine just having a CSV-file with one column "dates" and in it a bunch of dates in the format "dd/mm/yy hh:mm". The code I have right now, looks like:
data dates; infile 'C:\temp\dates.csv' dlm=';' truncover firstobs=2 lrecl=500; length date $16 ; format date $char16. ; informat date anydtdtm16. ; input date ; keep date ; run;
It results with an error in the log (from the original query and table):
1542 r_creation_date anydtdtm16. ----------- 48 ERROR 48-59: The informat $ANYDTDTM was not found or could not be loaded.
I guess I must be missing something somewhere? Do I need to adjust something in the format or even the input?
04-14-2016 09:28 AM
You try to assign a numerical format to a variable that was already defined as character. SAS then tries to correct your "obvious" mistake by searching for a proper character format (that would be named $anydtdtm), which does not exist.
04-14-2016 09:44 AM
Makes sense. My idea to solve this issue would be to change the format in the format section. When I try that, it results in the same error. I will just post the full query and the full log.
data unload.task; infile 'C:\temp\JNX_unload_tasks_ 201603_.csv' dlm=';' truncover firstobs=2 lrecl=500 dsd; length r_object_id $16 object_name $70 deadline $16 start_date $16 close_date $16 task_handler $25 task_state $9 dossier_nr $6 sequence $1 r_creation_date $16 ; format r_object_id $char16. object_name $char70. deadline anydtdtm16. start_date anydtdtm16. close_date anydtdtm16. task_handler $char25. task_state $char9. dossier_nr $char6. sequence $char1. r_creation_date anydtdtm16. ; informat r_object_id $char16. object_name $char70. deadline anydtdtm16. start_date anydtdtm16. close_date anydtdtm16. task_handler $char25. task_state $char9. dossier_nr $char6. sequence $char1. r_creation_date anydtdtm16. ; input r_object_id object_name deadline start_date close_date task_handler task_state dossier_nr sequence r_creation_date ; keep r_object_id object_name deadline start_date close_date task_handler task_state dossier_nr sequence r_creation_date ; run;
2543 data unload.task; 2544 infile 2544! '\\GLOW001.AD.SYS\0005_data\diensten\Bvh_h000\Dbe_h001\Rpb_0396\Prc_07ng\Plo_0425\Acn_02pp\Z 2544! cn_04io\1_Data\Interne_Werking\2_Operationeel\5_procescontrole\Gegevens\JNX\MAAND\CIJFERS\At 2544! ropos\unloads\JNX\JNX_unload_tasks_ 201603_.csv' dlm=';' truncover firstobs=2 lrecl=500 dsd; 2545 2546 /* De lengte werd exact bepaald m.b.v. de functie =LEN(celnaam) in Excel. */ 2547 2548 length 2549 r_object_id $16 2550 object_name $70 2551 deadline $16 2552 start_date $16 2553 close_date $16 2554 task_handler $25 2555 task_state $9 2556 dossier_nr $6 2557 sequence $1 2558 r_creation_date $16 2559 ; 2560 2561 /* Alle variabelen als tekst inlezen, behalve de datums. */ 2562 2563 format 2564 r_object_id $char16. 2565 object_name $char70. 2566 deadline anydtdtm16. ----------- 48 2567 start_date anydtdtm16. ----------- 48 2568 close_date anydtdtm16. ----------- 48 ERROR 48-59: The format $ANYDTDTM was not found or could not be loaded. 2569 task_handler $char25. 2570 task_state $char9. 2571 dossier_nr $char6. 2572 sequence $char1. 2573 r_creation_date anydtdtm16. ----------- 48 ERROR 48-59: The format $ANYDTDTM was not found or could not be loaded. 2574 ; 2575 2576 /* Alle variabelen als tekst uitsturen, behalve de datums. */ 2577 2578 informat 2579 r_object_id $char16. 2580 object_name $char70. 2581 deadline anydtdtm16. ----------- 48 2582 start_date anydtdtm16. ----------- 48 2583 close_date anydtdtm16. ----------- 48 ERROR 48-59: The informat $ANYDTDTM was not found or could not be loaded. 2584 task_handler $char25. 2585 task_state $char9. 2586 dossier_nr $char6. 2587 sequence $char1. 2588 r_creation_date anydtdtm16. ----------- 48 ERROR 48-59: The informat $ANYDTDTM was not found or could not be loaded. 2589 ; 2590 2591 input 2592 r_object_id 2593 object_name 2594 deadline 2595 start_date 2596 close_date 2597 task_handler 2598 task_state 2599 dossier_nr 2600 sequence 2601 r_creation_date 2602 ; 2603 2604 /* Volledige selectie maken. */ 2605 2606 keep 2607 r_object_id 2608 object_name 2609 deadline 2610 start_date 2611 close_date 2612 task_handler 2613 task_state 2614 dossier_nr 2615 sequence 2616 r_creation_date 2617 ; 2618 2619 run; NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set UNLOAD.TASK may be incomplete. When this step was stopped there were 0 observations and 10 variables. WARNING: Data set UNLOAD.TASK was not replaced because this step was stopped. NOTE: DATA statement used (Total process time): real time 0.46 seconds cpu time 0.09 seconds
It seems as though SAS doesn't "know" this format? Could that be the case?
Need further help from the community? Please ask a new question.