- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I am trying to import a csv file with ~20 million observations to sas. However, when it creates a SAS file, it has only 15 million observations. Can anyone help me with solving this problem?
Here is the code that I used to import csv file to sas.
proc import datafile="I:\2016q2.csv"
4
5 out=ret.r2016q2
6
7 dbms=csv
8
9 replace;
10
11 run;
12 /**********************************************************************
13 * PRODUCT: SAS
14 * VERSION: 9.4
15 * CREATOR: External File Interface
16 * DATE: 09JAN20
17 * DESC: Generated SAS Datastep Code
18 * TEMPLATE SOURCE: (None Specified.)
19 ***********************************************************************/
20 data RET.R2016Q2 ;
21 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
22 infile 'I:\2016q2.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
23 informat COMPANY_NAME $31. ;
24 informat ADDRESS $25. ;
25 informat CITY $13. ;
26 informat STATE $4. ;
27 informat ZIP $12. ;
28 informat COUNTY $16. ;
29 informat PHONE $12. ;
30 informat FAX $14. ;
31 informat WWW $31. ;
32 informat LATITUDE $11. ;
33 informat LONGITUDE $13. ;
34 informat CONTACT_NAME $19. ;
35 informat TITLE $16. ;
36 informat GENDER $3. ;
37 informat EMPLOYEE_COUNT $8. ;
38 informat ANNUAL_SALES $24. ;
39 informat SIC_CODE $8. ;
40 informat INDUSTRY $39. ;
41 format COMPANY_NAME $31. ;
42 format ADDRESS $25. ;
43 format CITY $13. ;
44 format STATE $4. ;
45 format ZIP $12. ;
46 format COUNTY $16. ;
47 format PHONE $12. ;
48 format FAX $14. ;
49 format WWW $31. ;
50 format LATITUDE $11. ;
51 format LONGITUDE $13. ;
52 format CONTACT_NAME $19. ;
53 format TITLE $16. ;
54 format GENDER $3. ;
55 format EMPLOYEE_COUNT $8. ;
56 format ANNUAL_SALES $24. ;
57 format SIC_CODE $8. ;
58 format INDUSTRY $39. ;
59 input
60 COMPANY_NAME $
61 ADDRESS $
62 CITY $
63 STATE $
64 ZIP $
65 COUNTY $
66 PHONE $
67 FAX $
68 WWW $
69 LATITUDE $
70 LONGITUDE $
71 CONTACT_NAME $
72 TITLE $
73 GENDER $
74 EMPLOYEE_COUNT $
75 ANNUAL_SALES $
76 SIC_CODE $
77 INDUSTRY $
78 ;
79 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
80 run;
NOTE: The infile 'I:\2016q2.csv' is:
Filename=I:\2016q2.csv,
RECFM=V,LRECL=32767,
File Size (bytes)=4555329197,
Last Modified=09Jan2020:13:34:31,
Create Time=09Jan2020:13:18:10
NOTE: 15860453 records were read from the infile 'I:\2016q2.csv'.
The minimum record length was 1.
The maximum record length was 333.
NOTE: The data set RET.R2016Q2 has 15860453 observations and 18 variables.
NOTE: DATA statement used (Total process time):
real time 34.31 seconds
cpu time 21.81 seconds
15860453 rows created in RET.R2016Q2 from I:\2016q2.csv.
NOTE: RET.R2016Q2 data set was successfully created.
NOTE: The data set RET.R2016Q2 has 15860453 observations and 18 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 35.68 seconds
cpu time 22.03 seconds
Thank you in advance.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @cphd
It might be a stray hex 1A (DOS end-of-file) character in data. try to submit the following code and see what happens. First step reads one character from each input line, but stops when A1x is encountered.
Second step has the infile option ignoredoseof, so it will read to end of the physical file. If it reads more records than the first, the problem is caused by the A1x.
filename in "I:\2016q2.csv";
data _null_;
infile in lrecl=1;
input;
run;
filename in "I:\2016q2.csv" ignoredoseof;
data _null_;
infile in lrecl=1;
input;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please post the SAS log of your program.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Edited the main post. Please see the log attached.
The result shows that SAS found 15860453 rows from the file, while the actual total number of rows is 21,173,376 (including the header) according to the Delimit program.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I put my trust on SAS here.
It may be that you have non-standard line delimiters in there that delimit counts, but SAS does not recognize (CR instead of CRLF).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the response. When I checked in the Delimit program (or csv file), I am able to see the data over 21MM data points. Is there anyway I can force SAS to import all data in the csv file or any better way to import all?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@cphd - There is nothing in the log to suggest there are any problems. Does the file size (File Size (bytes)=4555329197) match what you are seeing?
I suggest you copy and paste the DATA step from the SAS log to your program window (you can do a block select of the statements to avoid the line numbers by pressing the ALT key when selecting rows with your mouse). Try re-running this with LRECL= 333 on your INFILE statement as this is the reported maximum record length.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The size of the data is correct. I don't know why it drops some observations. It happens to other files that I am working on as well...
I re-run the code following your suggestion with LRECL=333, but it imported the same number which is smaller than what it is supposed to be...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I can only think that there must be something unusual about these files that the SAS default file-handling is not picking up.
What OS did these files originate from? Maybe you need to experiment with the TERMSTR INFILE option: https://documentation.sas.com/?docsetId=hostwin&docsetTarget=chfnoptfmain.htm&docsetVersion=9.4&loca...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @SASKiwi,
Thank you so much for helping with this. I checked the file again. TERMSTR is set up as CRLF... But it has a text qualifier ". Do you think it could cause the importing problem? Delimiter is , and Text Qualifier is ".
Also, I divided the csv file into small pieces and then imported & appended the files. By doing this, I was able to increase the total number of obs (still in short though). However, WARNING: "Multiple lengths were specified for the variable X by input data set(s). This can cause truncation of data" shows up. Do you think it could cause additional problem?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
As soon as you have to read multiple files with the same structure (or even just some variables in common), GET RID OF PROC IMPORT!
Take the data step (created by proc import) from the log, adapt it to your needs (documentation of the file!), and use it on all your files. This prevents all problems caused by the continuing guessing of proc import.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Try option ignoredoseof and termstr=
infile 'I:\2016q2.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ignoredoseof termstr=crlf ;
infile 'I:\2016q2.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ignoredoseof termstr=lf ;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
As @Ksharp said, try using the IGNOREDOSEOF. To do that with PROC IMPORT you can define an fileref using the FILENAME statement and then pass the fileref instead of the physical filename to the proc.
filename raw "I:\2016q2.csv" ignoredoseof ;
proc import datafile=raw
out=ret.r2016q2
dbms=csv
replace
;
run;
If that helps then you might have a larger issue with gibberish characters in your text file. Perhaps you encoding is wrong?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @cphd
It might be a stray hex 1A (DOS end-of-file) character in data. try to submit the following code and see what happens. First step reads one character from each input line, but stops when A1x is encountered.
Second step has the infile option ignoredoseof, so it will read to end of the physical file. If it reads more records than the first, the problem is caused by the A1x.
filename in "I:\2016q2.csv";
data _null_;
infile in lrecl=1;
input;
run;
filename in "I:\2016q2.csv" ignoredoseof;
data _null_;
infile in lrecl=1;
input;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much all for helps. The problem was from the empty rows in A1.
The codes below shared by other experts worked perfect! Thank you all!
filename raw "I:\2016q2.csv" ignoredoseof;
proc import datafile=raw
out=redo.r2016q2
dbms=csv
replace;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Note that have the DOS End of File character in your text files has nothing to do with what might have been or not been in the top left cell of some spreadsheet. But it might indicate some other error in the process that created the file.