SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
GerbenRotman
Calcite | Level 5

We encounter the following problem when trying to read in a simple CSV file which on the first row has an incomplete date (only the year) and on the second row a complete date. Here is a simplified version of the CSV file that can be used to demonstrate the issue:

text,start,end
abc,1968,1989
def,1981,2/14/2011

When we try to read this in with proc import using guessingrows=3, or higher, like this:

proc import out=test
datafile='<path>/test.csv'
dbms=csv replace;
guessingrows=3;
getnames=yes;
run;

then we get an error for the first data row because proc import chooses to use: "informat end mmddyy10.;", and "1989" does not follow that format.

If we use guessingrows=2; then it correctly notices that the first data row is not a date and it uses "informat end best12.;" (which then of course gives, as expected, an error for the second data row).

What should have happened I would think, is that with "guessingrows" set to 3 or higher it should have seen that it needs to use neither a numeric nor a date format but a character one like "informat end $9.;". It appears that this is a bug in proc import.




1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

UPDATED:

The text you posted that I copied had an string that could not be a date, that is why the code below worked.

 

You should report this to SAS as a BUG.

https://support.sas.com/en/technical-support/submit-a-support-request.html 

 

 

Spoiler

 

That seems wrong.  I cannot re-create the issue you are seeing. 

What version of SAS are you using?

Is there something else going on with that file?

 

This works on SAS 9.4M5 and the 9.4m6 version available in SAS ODA.

filename csv temp;
data _null_;
  file csv;
  put 'date'/'1989'/'22/14/2011';
run;
proc import dbms=csv file=csv out=want replace;
run;

proc print;
run;
36   /**********************************************************************
637   *   PRODUCT:   SAS
638   *   VERSION:   9.4
639   *   CREATOR:   External File Interface
640   *   DATE:      17AUG22
641   *   DESC:      Generated SAS Datastep Code
642   *   TEMPLATE SOURCE:  (None Specified.)
643   ***********************************************************************/
644      data WORK.WANT    ;
645      %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
646      infile CSV delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
647         informat date $10. ;
648         format date $10. ;
649      input
650                  date  $
651      ;
652      if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
653      run;

Tom_0-1660757756002.png

 

 

View solution in original post

17 REPLIES 17
PaigeMiller
Diamond | Level 26

PROC IMPORT is a guessing procedure ... if makes "reasonable" guesses as to how to read the file ("reasonable" to the people who programmmed it). It can not correctly handle EVERY situation that might ever come up. So is this a bug? I don't know, I guess that's a matter of opinon.

 

However, when reading a CSV file, the log contains the actual code used by PROC IMPORT to read the file. You can copy that code from the log to your program window, modify it properly to read the data as you want, and now you've got code that meets your needs.

--
Paige Miller
GerbenRotman
Calcite | Level 5

The problem is that this is used in code scheduled to run automatically on a daily basis that reads in various CSV files created daily. The contents of the CSV files varies, we cannot use the static code as created by proc import in the log. The code really relies on proc import being able to adapt the way reads the files in based on their contents.

The fact that it somehow gets in trouble with such a simple CSV file (i.e. for value "1989" followed by "2/14/2011" it guesses that it can use informat "mmddyy10.") seems to indicate to me that there is likely some simple oversight in how the guessing procedure is programmed, which is why I would classify this as a bug. This does not seem to be a problematic data case for which it would be hard to guess a proper format.

Tom
Super User Tom
Super User

For this particular problem the programmer error was not including the GUESSINGROWS=MAX statement in the PROC IMPORT code. 

 

The larger problem is using a guessing procedure in a scheduled job.  PROC IMPORT is designed to be used interactively by a person. Please explain WHY the input to a scheduled job varies?  What is the source of the variation? Can you eliminate it?  If not how much variation are you talking about?  If the fields are always the same but sometimes are not present or some times have their order changed then you can program around that easily.

 

If you really have no idea what is coming you can write your own code to read the CSV file and decide what they contain.

Here is an example:

https://github.com/sasutils/macros/blob/master/csv2ds.sas

 

Differences from PROC IMPORT
- Supports header lines with more than 32,767 characters
- Supports ZIP and GZIP source files
- Generates unique variable names by adding numeric suffix
- Does not overestimate maxlength when longest value is quoted
- Does NOT force character type if all values are quoted
- Generates label when generated variable name is different than header
- Supports NAMEROW option
- Supports numeric fields with special missing values (MISSING statement)
- Does not attach unneeded informats or formats
- Allows overriding calculated metadata
- Allow using random sample of rows to guess metadata
- Generates more compact SAS code
- Generates analysis summary dataset and raw data view
- Saves generated SAS code to a file
- Forces DATE and DATETIME formats to show century
- Difference in generated V7 compatible variable names
  - Replaces adjacent non-valid characters with single underscore
GerbenRotman
Calcite | Level 5

The guessingrows=3 was just for the example CSV file, the one that has only 2 data rows, so was perfectly adequate there I think (there is no "programmer error" there).

The program using proc import checks a folder for csv files that may have been output there by various programs of various other programmers. All it tries to do is read in all the CSV files it finds there, and then output them again to an excel file, each csv file as a separate sheet. So, all that proc import was supposed to do there is create datasets from CSV files.

It seems that indeed in order to avoid these somewhat "silly" issues we may need to create some separate macro in which we read in the CSV files ourselves. That appears then to be the only way to circumvent proc imports guessing behavior.

Tom
Super User Tom
Super User

If you want to read the CSV file as text there is no need to use PROC IMPORT.

data want;
  infile 'myfile.csv' dsd truncover ;
  input (var1-var10) (:$200.);
run;

And if you can assume the CSV file is a least well formed enough to have the same number of fields on every line you can automate figuring out how many columns to read.

data _null_;
  infile 'myfile.csv' obs=1;
  input;
  call symputx('nvar',countw(_infile_,',','mq'));
run;
data want;
  infile 'myfile.csv' dsd truncover ;
  input (var1-var&nvar) (:$200.);
run;

If the lines could be longer than 32767 bytes you can adjust for that also.

GerbenRotman
Calcite | Level 5

Yes, thanks for the suggestions. Something like that I think is what we may use to replace the various proc imports that are used currently, in order to avoid the occasional guessing mistakes. It will need to be a bit more fancy that this though, probably will need some macro code, because we do need to use the column headers from the first row as variable names. I understand it is not so difficult to make, it would just have been nicer to use a built in SAS procedure rather than custom code.

Tom
Super User Tom
Super User

@GerbenRotman wrote:

Yes, thanks for the suggestions. Something like that I think is what we may use to replace the various proc imports that are used currently, in order to avoid the occasional guessing mistakes. It will need to be a bit more fancy that this though, probably will need some macro code, because we do need to use the column headers from the first row as variable names. I understand it is not so difficult to make, it would just have been nicer to use a built in SAS procedure rather than custom code.


Why are you not just telling PROC IMPORT to guess using ALL of the data? 

proc import dbms=csv datafile='myfile.csv' out=want replace;
  guessingrows=max;
run;

That is actually probably going to work better than the equivalent transformations that Excel would do by default when opening a CSV file.

GerbenRotman
Calcite | Level 5
We are using guessingrows=1000, these csv files are very unlikely to have more than 1000 rows. But, yes, we could put guessingrows=max to be even more save. It does not protect against the issue at hand though (where all rows, all two of them, are used to make the guess, but the guess is still wrong).
Tom
Super User Tom
Super User

No need to use variables names if everytihing is character.  Just tell the step that is writing the XLSX file not to write the names.

 

Here is an example that creates two CSV files and then reads them in and writes them into two sheets in a single XLSX file.

filename csv1 temp;
filename csv2 temp;
filename out 'c:\downloads\no_header.xlsx' ;

proc export data=sashelp.class dbms=csv file=csv1 replace;
run;
proc export data=sashelp.cars dbms=csv file=csv2 replace;
run;

data want;
  infile csv1 dsd truncover;
  input (var1-var5) (:$100.);
run;

proc export data=want file=out dbms=xlsx replace;
  sheet='class';
  putnames=no;
run;

data want;
  infile csv2 dsd truncover;
  input (var1-var15) (:$100.);
run;

proc export data=want file=out dbms=xlsx replace;
  sheet='cars';
  putnames=no;
run;
GerbenRotman
Calcite | Level 5
Ok yes, that would work but would require us to update the way we output to Excel (to use proc export). Currently we output the datasets to Excel using "libname xlsx", like so:
libname outrep xlsx "<path>/outputfile.xlsx";
and then the datasets made by proc import are set to a sheet in this library.

Anyways, I understand we could do it differently, avoiding the use of proc import. The main issue I was hoping that could be solved is that proc import sometimes makes such obvious wrong guesses. But I guess the solution is, just do not use proc import for that. I guess I placed too much trust in proc import.
ballardw
Super User

Use @PaigeMiller's advice to create a basic program. Read those not-well-formed "date" values as a character value with a $10. or longer informat. Then parse the character value to create the desired "date" type value you want. Which you haven't described.

And for further education try Guessingrows=MAX; Only examining 2 or 3 rows you will encounter many other actual data content problems such as character values getting set to short values based on the content of the first couple rows, missing values because of mismatched types.

 

Do you want an actual date? You will have to supply a month and day of month for values that only show a year. For example this assumes that you want to assign Jan 1 as the date of year when only a year value is encountered.

 

/* assumes 4 characters are year only*/
if length chardate=4 then enddate= mdy(1,1,input(chardate,4.));
else if length chardate=10 then enddate= input(chardate, mmddyy10.);
format enddate mmddyy10.

BTW, the problem is not the "format" but the INFORMAT reading the value. You want to keep those ideas separately as many formats may be used to display any given value. INFORMATs read text into a given value.

GerbenRotman
Calcite | Level 5

This is part of code that reads in many CSV files daily (reads them in and combines them in an excel file). That is why it was practical to use proc import, as long as it does not run into errors like this one. I understand that "1986" is not a complete date, I would have expected that proc import would as well; that it would not "guess" to read the column containing this value using informat "mmddyy10.", and that it would instead choose to use a character format for this column like "$9."

 

Ideally, in our case, it would be practical to tell proc import to just treat everything as character strings, but I do not think there is an option like that.


(I used guessingrows=2 and guessingrows=3 here just for illustrative purposes to show this specific issue, in the actual program it was set to 1000. Indeed in the initial post where I typed "format" I should have typed "informat", I have edited that post to correct this.)

PaigeMiller
Diamond | Level 26

Since you seem to be stuck with these .csv files with inconsistent dates, and PROC IMPORT doesn't do what you want, you can either

 

  1. program your own SAS code to work these dates into usable forms
  2. wait for SAS to fix it to your liking (if they fix it at all)

I'd choose #1

--
Paige Miller
GerbenRotman
Calcite | Level 5
Yes, thanks. We may go for option 1, indeed.
In general proc import worked quite well for this, there just seem to be some peculiar cases, like this one, where the guessing goes wrong. Seemingly here perhaps even due to there only being two rows.

with just these two rows:
1989
22/14/2011
proc report's guess is to use: informat mmddyy10.

if I just add a third row:
1989
22/14/2011
1989
it understands that it should use: informat $9.


sas-innovate-white.png

Join us for our biggest event of the year!

Four days of inspiring keynotes, product reveals, hands-on learning opportunities, deep-dive demos, and peer-led breakouts. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 17 replies
  • 3506 views
  • 1 like
  • 4 in conversation