- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
We encounter the following problem when trying to read in a simple CSV file which on the first row has an incomplete date (only the year) and on the second row a complete date. Here is a simplified version of the CSV file that can be used to demonstrate the issue:
text,start,end
abc,1968,1989
def,1981,2/14/2011
When we try to read this in with proc import using guessingrows=3, or higher, like this:
proc import out=test
datafile='<path>/test.csv'
dbms=csv replace;
guessingrows=3;
getnames=yes;
run;
then we get an error for the first data row because proc import chooses to use: "informat end mmddyy10.;", and "1989" does not follow that format.
If we use guessingrows=2; then it correctly notices that the first data row is not a date and it uses "informat end best12.;" (which then of course gives, as expected, an error for the second data row).
What should have happened I would think, is that with "guessingrows" set to 3 or higher it should have seen that it needs to use neither a numeric nor a date format but a character one like "informat end $9.;". It appears that this is a bug in proc import.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
UPDATED:
The text you posted that I copied had an string that could not be a date, that is why the code below worked.
You should report this to SAS as a BUG.
https://support.sas.com/en/technical-support/submit-a-support-request.html
That seems wrong. I cannot re-create the issue you are seeing.
What version of SAS are you using?
Is there something else going on with that file?
This works on SAS 9.4M5 and the 9.4m6 version available in SAS ODA.
filename csv temp;
data _null_;
file csv;
put 'date'/'1989'/'22/14/2011';
run;
proc import dbms=csv file=csv out=want replace;
run;
proc print;
run;
36 /********************************************************************** 637 * PRODUCT: SAS 638 * VERSION: 9.4 639 * CREATOR: External File Interface 640 * DATE: 17AUG22 641 * DESC: Generated SAS Datastep Code 642 * TEMPLATE SOURCE: (None Specified.) 643 ***********************************************************************/ 644 data WORK.WANT ; 645 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ 646 infile CSV delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ; 647 informat date $10. ; 648 format date $10. ; 649 input 650 date $ 651 ; 652 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ 653 run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
PROC IMPORT is a guessing procedure ... if makes "reasonable" guesses as to how to read the file ("reasonable" to the people who programmmed it). It can not correctly handle EVERY situation that might ever come up. So is this a bug? I don't know, I guess that's a matter of opinon.
However, when reading a CSV file, the log contains the actual code used by PROC IMPORT to read the file. You can copy that code from the log to your program window, modify it properly to read the data as you want, and now you've got code that meets your needs.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The problem is that this is used in code scheduled to run automatically on a daily basis that reads in various CSV files created daily. The contents of the CSV files varies, we cannot use the static code as created by proc import in the log. The code really relies on proc import being able to adapt the way reads the files in based on their contents.
The fact that it somehow gets in trouble with such a simple CSV file (i.e. for value "1989" followed by "2/14/2011" it guesses that it can use informat "mmddyy10.") seems to indicate to me that there is likely some simple oversight in how the guessing procedure is programmed, which is why I would classify this as a bug. This does not seem to be a problematic data case for which it would be hard to guess a proper format.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For this particular problem the programmer error was not including the GUESSINGROWS=MAX statement in the PROC IMPORT code.
The larger problem is using a guessing procedure in a scheduled job. PROC IMPORT is designed to be used interactively by a person. Please explain WHY the input to a scheduled job varies? What is the source of the variation? Can you eliminate it? If not how much variation are you talking about? If the fields are always the same but sometimes are not present or some times have their order changed then you can program around that easily.
If you really have no idea what is coming you can write your own code to read the CSV file and decide what they contain.
Here is an example:
https://github.com/sasutils/macros/blob/master/csv2ds.sas
Differences from PROC IMPORT - Supports header lines with more than 32,767 characters - Supports ZIP and GZIP source files - Generates unique variable names by adding numeric suffix - Does not overestimate maxlength when longest value is quoted - Does NOT force character type if all values are quoted - Generates label when generated variable name is different than header - Supports NAMEROW option - Supports numeric fields with special missing values (MISSING statement) - Does not attach unneeded informats or formats - Allows overriding calculated metadata - Allow using random sample of rows to guess metadata - Generates more compact SAS code - Generates analysis summary dataset and raw data view - Saves generated SAS code to a file - Forces DATE and DATETIME formats to show century - Difference in generated V7 compatible variable names - Replaces adjacent non-valid characters with single underscore
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The guessingrows=3 was just for the example CSV file, the one that has only 2 data rows, so was perfectly adequate there I think (there is no "programmer error" there).
The program using proc import checks a folder for csv files that may have been output there by various programs of various other programmers. All it tries to do is read in all the CSV files it finds there, and then output them again to an excel file, each csv file as a separate sheet. So, all that proc import was supposed to do there is create datasets from CSV files.
It seems that indeed in order to avoid these somewhat "silly" issues we may need to create some separate macro in which we read in the CSV files ourselves. That appears then to be the only way to circumvent proc imports guessing behavior.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you want to read the CSV file as text there is no need to use PROC IMPORT.
data want;
infile 'myfile.csv' dsd truncover ;
input (var1-var10) (:$200.);
run;
And if you can assume the CSV file is a least well formed enough to have the same number of fields on every line you can automate figuring out how many columns to read.
data _null_;
infile 'myfile.csv' obs=1;
input;
call symputx('nvar',countw(_infile_,',','mq'));
run;
data want;
infile 'myfile.csv' dsd truncover ;
input (var1-var&nvar) (:$200.);
run;
If the lines could be longer than 32767 bytes you can adjust for that also.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes, thanks for the suggestions. Something like that I think is what we may use to replace the various proc imports that are used currently, in order to avoid the occasional guessing mistakes. It will need to be a bit more fancy that this though, probably will need some macro code, because we do need to use the column headers from the first row as variable names. I understand it is not so difficult to make, it would just have been nicer to use a built in SAS procedure rather than custom code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@GerbenRotman wrote:
Yes, thanks for the suggestions. Something like that I think is what we may use to replace the various proc imports that are used currently, in order to avoid the occasional guessing mistakes. It will need to be a bit more fancy that this though, probably will need some macro code, because we do need to use the column headers from the first row as variable names. I understand it is not so difficult to make, it would just have been nicer to use a built in SAS procedure rather than custom code.
Why are you not just telling PROC IMPORT to guess using ALL of the data?
proc import dbms=csv datafile='myfile.csv' out=want replace;
guessingrows=max;
run;
That is actually probably going to work better than the equivalent transformations that Excel would do by default when opening a CSV file.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
No need to use variables names if everytihing is character. Just tell the step that is writing the XLSX file not to write the names.
Here is an example that creates two CSV files and then reads them in and writes them into two sheets in a single XLSX file.
filename csv1 temp;
filename csv2 temp;
filename out 'c:\downloads\no_header.xlsx' ;
proc export data=sashelp.class dbms=csv file=csv1 replace;
run;
proc export data=sashelp.cars dbms=csv file=csv2 replace;
run;
data want;
infile csv1 dsd truncover;
input (var1-var5) (:$100.);
run;
proc export data=want file=out dbms=xlsx replace;
sheet='class';
putnames=no;
run;
data want;
infile csv2 dsd truncover;
input (var1-var15) (:$100.);
run;
proc export data=want file=out dbms=xlsx replace;
sheet='cars';
putnames=no;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
libname outrep xlsx "<path>/outputfile.xlsx";
and then the datasets made by proc import are set to a sheet in this library.
Anyways, I understand we could do it differently, avoiding the use of proc import. The main issue I was hoping that could be solved is that proc import sometimes makes such obvious wrong guesses. But I guess the solution is, just do not use proc import for that. I guess I placed too much trust in proc import.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Use @PaigeMiller's advice to create a basic program. Read those not-well-formed "date" values as a character value with a $10. or longer informat. Then parse the character value to create the desired "date" type value you want. Which you haven't described.
And for further education try Guessingrows=MAX; Only examining 2 or 3 rows you will encounter many other actual data content problems such as character values getting set to short values based on the content of the first couple rows, missing values because of mismatched types.
Do you want an actual date? You will have to supply a month and day of month for values that only show a year. For example this assumes that you want to assign Jan 1 as the date of year when only a year value is encountered.
/* assumes 4 characters are year only*/ if length chardate=4 then enddate= mdy(1,1,input(chardate,4.)); else if length chardate=10 then enddate= input(chardate, mmddyy10.); format enddate mmddyy10.
BTW, the problem is not the "format" but the INFORMAT reading the value. You want to keep those ideas separately as many formats may be used to display any given value. INFORMATs read text into a given value.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is part of code that reads in many CSV files daily (reads them in and combines them in an excel file). That is why it was practical to use proc import, as long as it does not run into errors like this one. I understand that "1986" is not a complete date, I would have expected that proc import would as well; that it would not "guess" to read the column containing this value using informat "mmddyy10.", and that it would instead choose to use a character format for this column like "$9."
Ideally, in our case, it would be practical to tell proc import to just treat everything as character strings, but I do not think there is an option like that.
(I used guessingrows=2 and guessingrows=3 here just for illustrative purposes to show this specific issue, in the actual program it was set to 1000. Indeed in the initial post where I typed "format" I should have typed "informat", I have edited that post to correct this.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Since you seem to be stuck with these .csv files with inconsistent dates, and PROC IMPORT doesn't do what you want, you can either
- program your own SAS code to work these dates into usable forms
- wait for SAS to fix it to your liking (if they fix it at all)
I'd choose #1
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In general proc import worked quite well for this, there just seem to be some peculiar cases, like this one, where the guessing goes wrong. Seemingly here perhaps even due to there only being two rows.
with just these two rows:
1989
22/14/2011
proc report's guess is to use: informat mmddyy10.
if I just add a third row:
1989
22/14/2011
1989
it understands that it should use: informat $9.