I greatly appreciate any and all patience. I am not sure I will be able to formulate my exact question the first time, but I will try.
I am trying to import a .csv file from another department's automated process (they have an automated extract that outputs the data to a .csv format).
Visually looking at the "DATE" field when opening the .csv file displays "01-SEP" (as an example), but clicking on an individual cell, I can see that the value for this particular cell is actually "2023-09-01". When trying to import via data step, I get only the displayed value of "01-SEP". I would like the year as well.
Here is what I tried:
DATA WORK.import_file;
INFILE "Import_Path\Import_file.csv" DELIMITER = ',' MISSOVER DSD LRECL=32767 FIRSTOBS=2;
LENGTH ASSIGNMENT_ID 8.;
INFORMAT ASSIGNMENT_ID 15.;
FORMAT ASSIGNMENT_ID 15.;
LENGTH DATE 8.;
INFORMAT DATE YYMMDD10.;
FORMAT DATE YYMMDD10.;
INPUT ASSIGNMENT_ID:15. DATE:ANYDTDTE.;
RUN;
However, the year part of the date does not import correctly.
For instance, the "2023-09-01" csv cell imports as "2001-09-01". Here the year is incorrect.
And the "2023-08-31" csv cell imports as "2031-08-01". Here the year and day are incorrect.
In general, I like the "2023-08-31" format, but obviously need the date to be correct.
Thank you in advance for any and all advice.
Summary of the problem:
Trying to read a CSV file generated by an automated process containing an ID and date field. When the CSV file was opened in Excel, the data displayed something like this:
ID | Date |
100 | 1-Sep |
100 | 2-Sep |
100 | 3-Sep |
If you click on the date in the first cell, Excel displays 9/1/2023 as the actual value
It turns out that Excel is interpreting the data, not showing that data as it actually resides in the raw file. This isn't helpful. Instead, as recommended by @ballardw and @Tom, we should open the CSV file in a text editor like Notepad++ (or even the SAS program editor) to see the raw data as it actually is:
"ID","Date" 100,"09-01" 100,"09-02" 100,"09-03"
We can see that the file only provides month and day values for the date. We know that this file was generated by the system in the year 2023. When we read the file in SAS, we'll use SCAN to extract the Month and Day values as text, INPUT to convert the text values to numeric, and finally MDY to produce a SAS date value.
/* Creat a sample CSV file */
filename haveCSV temp;
data _null_;
file haveCSV ;
length line $15;
if _n_=1 then do;
put '"ID","Date"';
end;
do ID=100 to 102;
do day=1 to 5;
line=catx(',',ID,'"'||catx('-','09',put(day,z2.))||'"');
put line ;
end;
end;
run;
/* View the raw CSV file contents in the Log*/
data _null_;
infile haveCSV ;
input;
put _infile_;
run;
/* Read the CSV file into SAS */
data want;
infile haveCSV dsd missover firstobs=2;
length Assignment_ID $3 Date 8;
format Date YYMMDD10.;
input Assignment_ID _txtDate:$15.;
Date=mdy(input(scan(_txtDate,1,'-'),32.)
,input(scan(_txtDate,2,'-'),32.)
,2023);
drop _:;
run;
Here's what the data looked like in the log:
"ID","Date" 100,"09-01" 100,"09-02" 100,"09-03" 100,"09-04" 100,"09-05" 101,"09-01" 101,"09-02" 101,"09-03" 101,"09-04" 101,"09-05" 102,"09-01" 102,"09-02" 102,"09-03" 102,"09-04" 102,"09-05"
And the resulting SAS dataset:
Obs | Assignment_ID | Date |
---|---|---|
1 | 100 | 2023-09-01 |
2 | 100 | 2023-09-02 |
3 | 100 | 2023-09-03 |
4 | 100 | 2023-09-04 |
5 | 100 | 2023-09-05 |
6 | 101 | 2023-09-01 |
7 | 101 | 2023-09-02 |
8 | 101 | 2023-09-03 |
9 | 101 | 2023-09-04 |
10 | 101 | 2023-09-05 |
11 | 102 | 2023-09-01 |
12 | 102 | 2023-09-02 |
13 | 102 | 2023-09-03 |
14 | 102 | 2023-09-04 |
15 | 102 | 2023-09-05 |
What does the date look like when the CSV file is opened in Notepad or another text editor, not Excel?
Can please tell us why you assign an Informat to DATE of yymmdd10 but have ANYDTDTE on the Input statement???
When you don't provide a width with Anydtdte then it defaults to 9. If the value is actually "2023-09-01" then the format stops reading at the 0 in 01 and does not generate a value at all. See these two examples:
data example; input date anydtdte.; format date yymmdd10.; datalines; 2023-09-01 ; data example2; input date anydtdte10.; format date yymmdd10.; datalines; 2023-09-01 ;
You may want to open the CSV file in notepad or even the SAS editor and copy the first 4 or 5 lines. Then open a text box on the forum using the </> icon and paste those lines. Then we have no question about the content.
Any CSV opened by spreadsheet software is subject to all sorts of "interpretation" by that software and will often reflect prejudices of the programmers and not the actual contents. Plus the behavior may well change on a cell by cell basis.
If you have defined a special INFORMAT to use with a variable there is no need to also specify an INFORMAT as part of the INPUT statement.
There is no need to specify an INFORMAT to read normal numbers or normal strings.
There is normally no need to attached formats to numbers or strings.
Your variables highlight two of the exceptions.
DATE (and TIME and DATETIME) values will normally need a special INFORMAT to be read and benefit from having a special FORMAT attached so the values print in a way that humans can understand.
And numbers with more than 8 or 12 significant digits will not display properly with the default BEST12. format used by most of SAS or the BEST8. format used by PROC SQL. So attaching a format like 15. to a number that you know has integers that large makes sense.
But it does raise the question of why you would read an ID value as a number. What arithmetic are you planning to do with an ID value? What would the MEAN() of an ID value mean?
It probably makes more sense to define that variable as character.
Try this:
DATA import_file;
INFILE "Import_Path\Import_file.csv" DSD TRUNCOVER FIRSTOBS=2;
LENGTH ASSIGNMENT_ID $15 DATE 8;
INPUT ASSIGNMENT_ID DATE;
INFORMAT DATE YYMMDD10.;
FORMAT DATE YYMMDD10.;
RUN;
I will move forward with the ID as a character. It is present in the data as a number. But no, I will not be doing calculation on this field.
Thank you all so much for the guidance.
It seems like Excel has interpretted the date. When opening in Notepad++, the date is truly 09-01. (Excel represents this as 01-SEP).
I know the year for this date is 2023, so when I saw that Excel had that present in the individual cells, I assumed the field was formatted oddly.
(I had another issue with ASSIGNMENT_ID - this is a field with a 15 number value, viewing in Excel showed "3E+14", while each cell showed the actual number. I was able to import this using SAS to get the actual number. I thought maybe the date was in a similar situation).
I do not work with .csv files very often, so I never thought to look at it in Notepad or ++.
Thank you all again!
No need to use some editor to LOOK at the file. Just tell SAS to show you what is there.
This will dump the first 10 lines to log. It will print a ruler so you can were where on the line the text appears. If any of the characters are non-printable it will display the hex code for the all of the characters in that line.
data _null_;
infile "myfile.csv" obs=10;
input;
list;
run;
Read the 09-01 as a character variable and parse it to create a date value.
data example; input mon_day $; date=mdy(input(scan(mon_day,1),2.),input(scan(mon_day,2),2.),2023); format date yymmdd10.; datalines; 09-01 ;
Scan to pull out the pieces (will use the - to separate by default), Input to convert the 09 to numeric 9. The MDY function converts number in month, day ,year to a SAS date value. This assumes ALL of your dates are 2023.
Caution: if you let spreadsheet software save the CSV file on closing it may change the actual values. This can happen if you do something like change the width of the cells so the spreadsheet thinks you "changed" something and will often ask if you want to save the changes. If you have done such already I suggest trying to recover the original file from the source if at all possible.
@Reeza wrote:
Fun thing once - Excel changed the order of YYMMDD to YYDDMM of a variable causing issues with the analyst trying to read the file. And it changed between different versions of the CSV.
Huh? I have never seen anybody use date digit strings in Year Day Month order. Did EXCEL really do that? Who would be the audience for that?
The original code you showed was using ANYDTDTE informat. That will have to GUESS what order the parts are in when the values make it ambiguous, for example when the day of the month is less than 13. The setting of the DATESTYLE option (and even the LOCALE option) can change which decision it makes.
And if EXCEL removed the century from the year then more mistakes are possible. What values does 01-02-03 represent?
Summary of the problem:
Trying to read a CSV file generated by an automated process containing an ID and date field. When the CSV file was opened in Excel, the data displayed something like this:
ID | Date |
100 | 1-Sep |
100 | 2-Sep |
100 | 3-Sep |
If you click on the date in the first cell, Excel displays 9/1/2023 as the actual value
It turns out that Excel is interpreting the data, not showing that data as it actually resides in the raw file. This isn't helpful. Instead, as recommended by @ballardw and @Tom, we should open the CSV file in a text editor like Notepad++ (or even the SAS program editor) to see the raw data as it actually is:
"ID","Date" 100,"09-01" 100,"09-02" 100,"09-03"
We can see that the file only provides month and day values for the date. We know that this file was generated by the system in the year 2023. When we read the file in SAS, we'll use SCAN to extract the Month and Day values as text, INPUT to convert the text values to numeric, and finally MDY to produce a SAS date value.
/* Creat a sample CSV file */
filename haveCSV temp;
data _null_;
file haveCSV ;
length line $15;
if _n_=1 then do;
put '"ID","Date"';
end;
do ID=100 to 102;
do day=1 to 5;
line=catx(',',ID,'"'||catx('-','09',put(day,z2.))||'"');
put line ;
end;
end;
run;
/* View the raw CSV file contents in the Log*/
data _null_;
infile haveCSV ;
input;
put _infile_;
run;
/* Read the CSV file into SAS */
data want;
infile haveCSV dsd missover firstobs=2;
length Assignment_ID $3 Date 8;
format Date YYMMDD10.;
input Assignment_ID _txtDate:$15.;
Date=mdy(input(scan(_txtDate,1,'-'),32.)
,input(scan(_txtDate,2,'-'),32.)
,2023);
drop _:;
run;
Here's what the data looked like in the log:
"ID","Date" 100,"09-01" 100,"09-02" 100,"09-03" 100,"09-04" 100,"09-05" 101,"09-01" 101,"09-02" 101,"09-03" 101,"09-04" 101,"09-05" 102,"09-01" 102,"09-02" 102,"09-03" 102,"09-04" 102,"09-05"
And the resulting SAS dataset:
Obs | Assignment_ID | Date |
---|---|---|
1 | 100 | 2023-09-01 |
2 | 100 | 2023-09-02 |
3 | 100 | 2023-09-03 |
4 | 100 | 2023-09-04 |
5 | 100 | 2023-09-05 |
6 | 101 | 2023-09-01 |
7 | 101 | 2023-09-02 |
8 | 101 | 2023-09-03 |
9 | 101 | 2023-09-04 |
10 | 101 | 2023-09-05 |
11 | 102 | 2023-09-01 |
12 | 102 | 2023-09-02 |
13 | 102 | 2023-09-03 |
14 | 102 | 2023-09-04 |
15 | 102 | 2023-09-05 |
Never let EXCEL just open a CSV file on its own. Always use the text file import utility (where ever they have moved it in your release of EXCEL) so you can tell how to interpret each column.
You were lucky this time that the 09-01 strings actually represented dates. If they had been part numbers or other codes converting them to dates would have destroyed the data.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.