Solved: Re: Month/day switched when importing cvs file with datetime variable

JKHess · Posted 11-28-2023 10:46 AM

Hi, when importing a large csv file with several datetime variables, there is a discrepancy between the value SAS assigns vs the value Microsoft products (excel, power BI) assign, specifically, the month and day are switched for some but not all of the dates (e.g., Jan 4 in SAS, April 1 in MS). Has anyone run into this? How can I determine which is correct? Note, the informat statement defaulted to anydtdtm40., but I also tried anydtdte40. and got the same result.

Code below:
PROC IMPORT OUT= WORK.z
DATAFILE= "C:\filename.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;

data z2;
set z;
i_date=datepart(i_datetime);
format i_date mmddyy8.;

Tom · Posted 11-28-2023 02:34 PM

So the CSV export from that site messes up the datetime fields.

But if you use the API interface it generates the datetime fields in ISO standard format.

So copy the file yourself to a CSV file and it should be easier to read it.

data _null_;
  infile  "https://data.cityofnewyork.us/resource/76xm-jjuj.csv?$limit=30000000"  url ;
  file 'nycems.csv';
  input;
  put _infile_;
run;

View solution in original post

PaigeMiller · Posted 11-28-2023 11:17 AM

PROC IMPORT is a guessing procedure. In this case, it guessed wrong. However, the actual code it uses to import the CSV file is provided in the log, you can copy this code, modify it to produce the correct dates, and run it again.

--
Paige Miller

JKHess · Posted 11-28-2023 11:28 AM

Yes, I did modify the code in the log file when I changed the informat from anydtdtm40. to anydtdte40. Not sure what else to try.

It's not reading all the dates wrong (or differently from MS), just some of them.

antonbcristina · Posted 11-28-2023 11:29 AM

Would you be able to provide examples of the data you have?

PaigeMiller · Posted 11-28-2023 11:39 AM

@JKHess wrote:
Yes, I did modify the code in the log file when I changed the informat from anydtdtm40. to anydtdte40. Not sure what else to try.

It's not reading all the dates wrong (or differently from MS), just some of them.

These informats are not the ones you want to use.

You will need to identify the specific informats that will allow you to read the dates properly. It may be that you want to switch from the MMDDYY informat to the DDMMYY informat, or vice versa. Providing examples of the dates in the CSV file would be very helpful.

--
Paige Miller

Quentin · Posted 11-28-2023 01:00 PM

As @PaigeMiller said, PROC IMPORT is guessing as to the meaning of values it reads from the CSV. If you use Excel to read a CSV, it also will guess, and I suspect PowerBI will guess as well. When you use the anydt* informats, you are telling SAS to guess the meaning, because you don't know if the date could be mm/dd/yyyy or yyyy/mm/dd or one of many other possible formats.

If PROC IMPORT, Excel, and PowerBI are making different guesses as to the meaning of a date value stored in a CSV, there is no good way to know which guess is correct. You would need to ask the source of the CSV about the meaning of each value.

Hopefully the source will tell you that there is a consistent date format used for a variable. Then in SAS, you would specify the appropriate specific informat when you read in the data. The goal is to avoid guessing, which can always go wrong.

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.

JKHess · Posted 11-28-2023 11:38 AM

It's this file: https://data.cityofnewyork.us/Public-Safety/EMS-Incident-Dispatch-Data/76xm-jjuj

It's too big for me to open it with a text editor.

Kurt_Bremser · Posted 11-28-2023 12:04 PM

@JKHess wrote:
It's this file: https://data.cityofnewyork.us/Public-Safety/EMS-Incident-Dispatch-Data/76xm-jjuj

It's too big for me to open it with a text editor.

Then you use the wrong editor. Something like Notepad++ should not be overwhelmed by size.

Alternatively, this SAS method can also help:

data _null_;
infile "C:\filename.csv" lrecl=32767 obs=10;
input;
list;
run;

which shows the contents in the log, or this one:

data test;
infile "C:\filename.csv" lrecl=32767 obs=10;
length line $32767;
input;
line = _infile_;
run;

which puts the lines into a dataset.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 11-28-2023 12:29 PM

If the example shown is correct then the data appears in YYYY MON DD HH:MM:SS am/pm structure. Which is nearly one of the stupidest data storage formats I have seen in a long time. (MON means three letter month abbreviation)

This is one way to parse the values. It would require reading all the values as strings, then using an array to loop over them to process and assign the values to another array for the datetime value.

data junk;
  string = '2023 Oct 01 12:15:23 AM';
  dt = dhms(input(cats(scan(string,3),scan(string,2),scan(string,1)),date9.),0,0,input(cats(scan(string,4),scan(string,4)),time10.));
  format dt datetime20.;
run;

Tom · Posted 11-28-2023 02:34 PM

So the CSV export from that site messes up the datetime fields.

But if you use the API interface it generates the datetime fields in ISO standard format.

So copy the file yourself to a CSV file and it should be easier to read it.

data _null_;
  infile  "https://data.cityofnewyork.us/resource/76xm-jjuj.csv?$limit=30000000"  url ;
  file 'nycems.csv';
  input;
  put _infile_;
run;

Tom · Posted 11-28-2023 03:38 PM

Once you have a valid CSV file you can use %CSV2DS() to help you GUESS how to read it.

You will need to use the PRECENT= option to sample a subset of the rows since over 25 million observations will overwhelm PROC SQL's ability to summarize the data types.

%csv2ds('c:\downloads\nycems.csv',out=nycems,percent=1,replace=1)

Resulting SAS code:

data nycems;
  infile 'c:\downloads\nycems.csv' dlm=',' dsd truncover firstobs=2 ;
  length cad_incident_id 8 incident_datetime 8 initial_call_type $6
    initial_severity_level_code 8 final_call_type $6
    final_severity_level_code 8 first_assignment_datetime 8
    valid_dispatch_rspns_time_indc $5 dispatch_response_seconds_qy 8
    first_activation_datetime 8 first_on_scene_datetime 8
    valid_incident_rspns_time_indc $5 incident_response_seconds_qy 8
    incident_travel_tm_seconds_qy 8 first_to_hosp_datetime 8
    first_hosp_arrival_datetime 8 incident_close_datetime 8
    held_indicator $5 incident_disposition_code 8 borough $24
    incident_dispatch_area $2 zipcode 8 policeprecinct 8
    citycouncildistrict 8 communitydistrict 8 communityschooldistrict 8
    congressionaldistrict 8 reopen_indicator $5 special_event_indicator $5
    standby_indicator $5 transfer_indicator $5
  ;
  informat incident_datetime anydtdtm. first_assignment_datetime anydtdtm.
    first_activation_datetime anydtdtm. first_on_scene_datetime anydtdtm.
    first_to_hosp_datetime anydtdtm. first_hosp_arrival_datetime anydtdtm.
    incident_close_datetime anydtdtm. zipcode comma.
  ;
  format incident_datetime datetime19.
    first_assignment_datetime datetime19.
    first_activation_datetime datetime19.
    first_on_scene_datetime datetime19. first_to_hosp_datetime datetime19.
    first_hosp_arrival_datetime datetime19.
    incident_close_datetime datetime19. zipcode comma6.
  ;
  input cad_incident_id -- transfer_indicator ;
run;

You should probably remove the FORMAT from ZIPCODE since you don't want to replicate the mistake in the data and print the values with thousands separators.

You also will probably want to standardize the "indicator" variables to consistently use Y/N instead of sometimes using true/false.

Tom · Posted 11-28-2023 04:06 PM

Example:

proc format ;
invalue $yntf 'Y','true'='Y' 'N','false'='N';
run;


data nycems;
  infile  "c:\downloads\nycems.csv"  dlm=',' dsd truncover firstobs=2 ;
  length cad_incident_id 8 incident_datetime 8 initial_call_type $6
    initial_severity_level_code 8 final_call_type $6
    final_severity_level_code 8 first_assignment_datetime 8
    valid_dispatch_rspns_time_indc $1 dispatch_response_seconds_qy 8
    first_activation_datetime 8 first_on_scene_datetime 8
    valid_incident_rspns_time_indc $1 incident_response_seconds_qy 8
    incident_travel_tm_seconds_qy 8 first_to_hosp_datetime 8
    first_hosp_arrival_datetime 8 incident_close_datetime 8
    held_indicator $1 incident_disposition_code 8 borough $24
    incident_dispatch_area $2 zipcode 8 policeprecinct 8
    citycouncildistrict 8 communitydistrict 8 communityschooldistrict 8
    congressionaldistrict 8 reopen_indicator $1 special_event_indicator $1
    standby_indicator $1 transfer_indicator $1
  ;
  informat incident_datetime first_assignment_datetime 
    first_activation_datetime first_on_scene_datetime 
    first_to_hosp_datetime  first_hosp_arrival_datetime 
    incident_close_datetime anydtdtm.
    zipcode comma.
    valid_dispatch_rspns_time_indc valid_incident_rspns_time_indc  
    held_indicator reopen_indicator special_event_indicator standby_indicator 
    transfer_indicator $yntf.
  ;
  format incident_datetime 
    first_assignment_datetime 
    first_activation_datetime 
    first_on_scene_datetime  first_to_hosp_datetime 
    first_hosp_arrival_datetime 
    incident_close_datetime datetime19.
  ;
  input cad_incident_id -- transfer_indicator ;
run;

Reeza · Posted 11-28-2023 04:14 PM

Modifying @Tom 's code very slightly to read from the URL directly.

filename src url 'https://data.cityofnewyork.us/resource/76xm-jjuj.csv';

proc format ;
invalue $yntf 'Y','true'='Y' 'N','false'='N';
run;


data nycems;
  infile  src  dlm=',' dsd truncover firstobs=2 ;
  length cad_incident_id 8 incident_datetime 8 initial_call_type $6
    initial_severity_level_code 8 final_call_type $6
    final_severity_level_code 8 first_assignment_datetime 8
    valid_dispatch_rspns_time_indc $1 dispatch_response_seconds_qy 8
    first_activation_datetime 8 first_on_scene_datetime 8
    valid_incident_rspns_time_indc $1 incident_response_seconds_qy 8
    incident_travel_tm_seconds_qy 8 first_to_hosp_datetime 8
    first_hosp_arrival_datetime 8 incident_close_datetime 8
    held_indicator $1 incident_disposition_code 8 borough $24
    incident_dispatch_area $2 zipcode 8 policeprecinct 8
    citycouncildistrict 8 communitydistrict 8 communityschooldistrict 8
    congressionaldistrict 8 reopen_indicator $1 special_event_indicator $1
    standby_indicator $1 transfer_indicator $1
  ;
  informat incident_datetime first_assignment_datetime 
    first_activation_datetime first_on_scene_datetime 
    first_to_hosp_datetime  first_hosp_arrival_datetime 
    incident_close_datetime anydtdtm.
    zipcode comma.
    valid_dispatch_rspns_time_indc valid_incident_rspns_time_indc  
    held_indicator reopen_indicator special_event_indicator standby_indicator 
    transfer_indicator $yntf.
  ;
  format incident_datetime 
    first_assignment_datetime 
    first_activation_datetime 
    first_on_scene_datetime  first_to_hosp_datetime 
    first_hosp_arrival_datetime 
    incident_close_datetime datetime19.
  ;
  input cad_incident_id -- transfer_indicator ;
run;

ballardw · Posted 11-28-2023 06:07 PM

Modifying @Reeza and @Tom suggest code just a bit:

Custom informat:

proc format ;
invalue yntf (upcase)
'Y','YES','T','TRUE'=1
'N','NO','F','FALSE'=0
;
run;

And use Yntf (not character $Yntf ) informat. The counting of yes/no/true/false values quite often goes much easier using numeric as then Proc means/summary or the report procedures Report and Tabulate can get counts of Yes (sum statistic) percent as decimal value (mean statistic) and it is easier if you need to get something like that from multiple variables on a record.

Note that I added a couple of guessed other values that might appear just in case in the informat YES, NO, T, F and the upcase option will deal with possible changes in case down the line.

JKHess · Posted 12-03-2023 12:21 AM

@Tom Is there additional code needed to run this macro? I'm getting a warning and an error. I think I need to identify where the macro code is found?

%csv2ds("c:\Users\judyk\nycems.csv",out=nycems,percent=1,replace=1);
-
180 WARNING: Apparent invocation of macro CSV2DS not resolved.

ERROR 180-322: Statement is not valid or it is used out of proper order.

SAS Innovate 2025: Register Today!

SAS Training: Just a Click Away