Solved: Importing a huge csv file to SAS without dropping any observation

cphd · Posted 01-09-2020 05:19 PM

Hi all,

I am trying to import a csv file with ~20 million observations to sas. However, when it creates a SAS file, it has only 15 million observations. Can anyone help me with solving this problem?

Here is the code that I used to import csv file to sas.

 proc import datafile="I:\2016q2.csv"
4
5    out=ret.r2016q2
6
7    dbms=csv
8
9    replace;
10
11   run;

12    /**********************************************************************
13    *   PRODUCT:   SAS
14    *   VERSION:   9.4
15    *   CREATOR:   External File Interface
16    *   DATE:      09JAN20
17    *   DESC:      Generated SAS Datastep Code
18    *   TEMPLATE SOURCE:  (None Specified.)
19    ***********************************************************************/
20       data RET.R2016Q2    ;
21       %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
22       infile 'I:\2016q2.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
23          informat COMPANY_NAME $31. ;
24          informat ADDRESS $25. ;
25          informat CITY $13. ;
26          informat STATE $4. ;
27          informat ZIP $12. ;
28          informat COUNTY $16. ;
29          informat PHONE $12. ;
30          informat FAX $14. ;
31          informat WWW $31. ;
32          informat LATITUDE $11. ;
33          informat LONGITUDE $13. ;
34          informat CONTACT_NAME $19. ;
35          informat TITLE $16. ;
36          informat GENDER $3. ;
37          informat EMPLOYEE_COUNT $8. ;
38          informat ANNUAL_SALES $24. ;
39          informat SIC_CODE $8. ;
40          informat INDUSTRY $39. ;
41          format COMPANY_NAME $31. ;
42          format ADDRESS $25. ;
43          format CITY $13. ;
44          format STATE $4. ;
45          format ZIP $12. ;
46          format COUNTY $16. ;
47          format PHONE $12. ;
48          format FAX $14. ;
49          format WWW $31. ;
50          format LATITUDE $11. ;
51          format LONGITUDE $13. ;
52          format CONTACT_NAME $19. ;
53          format TITLE $16. ;
54          format GENDER $3. ;
55          format EMPLOYEE_COUNT $8. ;
56          format ANNUAL_SALES $24. ;
57          format SIC_CODE $8. ;
58          format INDUSTRY $39. ;
59       input
60                   COMPANY_NAME $
61                   ADDRESS $
62                   CITY $
63                   STATE $
64                   ZIP $
65                   COUNTY $
66                   PHONE $
67                   FAX $
68                   WWW $
69                   LATITUDE $
70                   LONGITUDE $
71                   CONTACT_NAME $
72                   TITLE $
73                   GENDER $
74                   EMPLOYEE_COUNT $
75                   ANNUAL_SALES $
76                   SIC_CODE $
77                   INDUSTRY $
78       ;
79       if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
80       run;

NOTE: The infile 'I:\2016q2.csv' is:
      Filename=I:\2016q2.csv,
      RECFM=V,LRECL=32767,
      File Size (bytes)=4555329197,
      Last Modified=09Jan2020:13:34:31,
      Create Time=09Jan2020:13:18:10

NOTE: 15860453 records were read from the infile 'I:\2016q2.csv'.
      The minimum record length was 1.
      The maximum record length was 333.
NOTE: The data set RET.R2016Q2 has 15860453 observations and 18 variables.
NOTE: DATA statement used (Total process time):
      real time           34.31 seconds
      cpu time            21.81 seconds


15860453 rows created in RET.R2016Q2 from I:\2016q2.csv.



NOTE: RET.R2016Q2 data set was successfully created.
NOTE: The data set RET.R2016Q2 has 15860453 observations and 18 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           35.68 seconds
      cpu time            22.03 seconds

Thank you in advance.

ErikLund_Jensen · Posted 01-10-2020 12:31 PM

Hi @cphd

It might be a stray hex 1A (DOS end-of-file) character in data. try to submit the following code and see what happens. First step reads one character from each input line, but stops when A1x is encountered.

Second step has the infile option ignoredoseof, so it will read to end of the physical file. If it reads more records than the first, the problem is caused by the A1x.


filename in "I:\2016q2.csv";
data _null_;
	infile in lrecl=1;
	input;
run;

filename in "I:\2016q2.csv" ignoredoseof;
data _null_;
	infile in lrecl=1;
	input;
run;

View solution in original post

SASKiwi · Posted 01-09-2020 05:21 PM

Please post the SAS log of your program.

cphd · Posted 01-09-2020 06:08 PM

Edited the main post. Please see the log attached.

The result shows that SAS found 15860453 rows from the file, while the actual total number of rows is 21,173,376 (including the header) according to the Delimit program.

Kurt_Bremser · Posted 01-09-2020 06:30 PM

I put my trust on SAS here.

It may be that you have non-standard line delimiters in there that delimit counts, but SAS does not recognize (CR instead of CRLF).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

cphd · Posted 01-09-2020 06:40 PM

Thank you for the response. When I checked in the Delimit program (or csv file), I am able to see the data over 21MM data points. Is there anyway I can force SAS to import all data in the csv file or any better way to import all?

SASKiwi · Posted 01-09-2020 06:33 PM

@cphd - There is nothing in the log to suggest there are any problems. Does the file size (File Size (bytes)=4555329197) match what you are seeing?

I suggest you copy and paste the DATA step from the SAS log to your program window (you can do a block select of the statements to avoid the line numbers by pressing the ALT key when selecting rows with your mouse). Try re-running this with LRECL= 333 on your INFILE statement as this is the reported maximum record length.

cphd · Posted 01-09-2020 06:51 PM

The size of the data is correct. I don't know why it drops some observations. It happens to other files that I am working on as well...

I re-run the code following your suggestion with LRECL=333, but it imported the same number which is smaller than what it is supposed to be...

SASKiwi · Posted 01-09-2020 07:10 PM

I can only think that there must be something unusual about these files that the SAS default file-handling is not picking up.

What OS did these files originate from? Maybe you need to experiment with the TERMSTR INFILE option: https://documentation.sas.com/?docsetId=hostwin&docsetTarget=chfnoptfmain.htm&docsetVersion=9.4&loca...

cphd · Posted 01-10-2020 05:40 AM

Hi @SASKiwi,

Thank you so much for helping with this. I checked the file again. TERMSTR is set up as CRLF... But it has a text qualifier ". Do you think it could cause the importing problem? Delimiter is , and Text Qualifier is ".

Also, I divided the csv file into small pieces and then imported & appended the files. By doing this, I was able to increase the total number of obs (still in short though). However, WARNING: "Multiple lengths were specified for the variable X by input data set(s). This can cause truncation of data" shows up. Do you think it could cause additional problem?

Kurt_Bremser · Posted 01-10-2020 07:18 AM

As soon as you have to read multiple files with the same structure (or even just some variables in common), GET RID OF PROC IMPORT!

Take the data step (created by proc import) from the log, adapt it to your needs (documentation of the file!), and use it on all your files. This prevents all problems caused by the continuing guessing of proc import.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 01-10-2020 06:22 AM

Try option ignoredoseof and termstr=

infile 'I:\2016q2.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ignoredoseof termstr=crlf ;

infile 'I:\2016q2.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ignoredoseof termstr=lf ;

Tom · Posted 01-10-2020 11:13 AM

As @Ksharp said, try using the IGNOREDOSEOF. To do that with PROC IMPORT you can define an fileref using the FILENAME statement and then pass the fileref instead of the physical filename to the proc.

filename raw "I:\2016q2.csv" ignoredoseof ;
proc import datafile=raw
  out=ret.r2016q2
  dbms=csv
  replace
;
run;

If that helps then you might have a larger issue with gibberish characters in your text file. Perhaps you encoding is wrong?

ErikLund_Jensen · Posted 01-10-2020 12:31 PM

Hi @cphd

It might be a stray hex 1A (DOS end-of-file) character in data. try to submit the following code and see what happens. First step reads one character from each input line, but stops when A1x is encountered.

Second step has the infile option ignoredoseof, so it will read to end of the physical file. If it reads more records than the first, the problem is caused by the A1x.


filename in "I:\2016q2.csv";
data _null_;
	infile in lrecl=1;
	input;
run;

filename in "I:\2016q2.csv" ignoredoseof;
data _null_;
	infile in lrecl=1;
	input;
run;

cphd · Posted 01-10-2020 05:06 PM

Thank you so much all for helps. The problem was from the empty rows in A1.

The codes below shared by other experts worked perfect! Thank you all!

filename raw "I:\2016q2.csv" ignoredoseof;
proc import datafile=raw
out=redo.r2016q2
dbms=csv
replace;
run;

Tom · Posted 01-10-2020 09:04 PM

Note that have the DOS End of File character in your text files has nothing to do with what might have been or not been in the top left cell of some spreadsheet. But it might indicate some other error in the process that created the file.

Importing a huge csv file to SAS without dropping any observation

Re: Importing a huge csv file to SAS without dropping any observation

Re: Importing a huge cvs file to SAS without dropping any observation

Re: Importing a huge cvs file to SAS without dropping any observation

Re: Importing a huge cvs file to SAS without dropping any observation

Re: Importing a huge cvs file to SAS without dropping any observation

Re: Importing a huge cvs file to SAS without dropping any observation

Re: Importing a huge cvs file to SAS without dropping any observation

Re: Importing a huge cvs file to SAS without dropping any observation

Re: Importing a huge cvs file to SAS without dropping any observation

Re: Importing a huge cvs file to SAS without dropping any observation

Re: Importing a huge csv file to SAS without dropping any observation

Re: Importing a huge csv file to SAS without dropping any observation

Re: Importing a huge csv file to SAS without dropping any observation

Re: Importing a huge csv file to SAS without dropping any observation

Re: Importing a huge csv file to SAS without dropping any observation

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away