BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cphd
Obsidian | Level 7

Hi all, 

 

I am trying to import a csv file with ~20 million observations to sas. However, when it creates a SAS file, it has only 15 million observations. Can anyone help me with solving this problem? 

 

Here is the code that I used to import csv file to sas. 

 

 proc import datafile="I:\2016q2.csv"
4
5    out=ret.r2016q2
6
7    dbms=csv
8
9    replace;
10
11   run;

12    /**********************************************************************
13    *   PRODUCT:   SAS
14    *   VERSION:   9.4
15    *   CREATOR:   External File Interface
16    *   DATE:      09JAN20
17    *   DESC:      Generated SAS Datastep Code
18    *   TEMPLATE SOURCE:  (None Specified.)
19    ***********************************************************************/
20       data RET.R2016Q2    ;
21       %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
22       infile 'I:\2016q2.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
23          informat COMPANY_NAME $31. ;
24          informat ADDRESS $25. ;
25          informat CITY $13. ;
26          informat STATE $4. ;
27          informat ZIP $12. ;
28          informat COUNTY $16. ;
29          informat PHONE $12. ;
30          informat FAX $14. ;
31          informat WWW $31. ;
32          informat LATITUDE $11. ;
33          informat LONGITUDE $13. ;
34          informat CONTACT_NAME $19. ;
35          informat TITLE $16. ;
36          informat GENDER $3. ;
37          informat EMPLOYEE_COUNT $8. ;
38          informat ANNUAL_SALES $24. ;
39          informat SIC_CODE $8. ;
40          informat INDUSTRY $39. ;
41          format COMPANY_NAME $31. ;
42          format ADDRESS $25. ;
43          format CITY $13. ;
44          format STATE $4. ;
45          format ZIP $12. ;
46          format COUNTY $16. ;
47          format PHONE $12. ;
48          format FAX $14. ;
49          format WWW $31. ;
50          format LATITUDE $11. ;
51          format LONGITUDE $13. ;
52          format CONTACT_NAME $19. ;
53          format TITLE $16. ;
54          format GENDER $3. ;
55          format EMPLOYEE_COUNT $8. ;
56          format ANNUAL_SALES $24. ;
57          format SIC_CODE $8. ;
58          format INDUSTRY $39. ;
59       input
60                   COMPANY_NAME $
61                   ADDRESS $
62                   CITY $
63                   STATE $
64                   ZIP $
65                   COUNTY $
66                   PHONE $
67                   FAX $
68                   WWW $
69                   LATITUDE $
70                   LONGITUDE $
71                   CONTACT_NAME $
72                   TITLE $
73                   GENDER $
74                   EMPLOYEE_COUNT $
75                   ANNUAL_SALES $
76                   SIC_CODE $
77                   INDUSTRY $
78       ;
79       if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
80       run;

NOTE: The infile 'I:\2016q2.csv' is:
      Filename=I:\2016q2.csv,
      RECFM=V,LRECL=32767,
      File Size (bytes)=4555329197,
      Last Modified=09Jan2020:13:34:31,
      Create Time=09Jan2020:13:18:10

NOTE: 15860453 records were read from the infile 'I:\2016q2.csv'.
      The minimum record length was 1.
      The maximum record length was 333.
NOTE: The data set RET.R2016Q2 has 15860453 observations and 18 variables.
NOTE: DATA statement used (Total process time):
      real time           34.31 seconds
      cpu time            21.81 seconds


15860453 rows created in RET.R2016Q2 from I:\2016q2.csv.



NOTE: RET.R2016Q2 data set was successfully created.
NOTE: The data set RET.R2016Q2 has 15860453 observations and 18 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           35.68 seconds
      cpu time            22.03 seconds

 

Thank you in advance.

 

1 ACCEPTED SOLUTION

Accepted Solutions
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @cphd 

 

It might be a stray hex 1A (DOS end-of-file) character in data.  try to submit the following code and see what happens. First step reads one character from each input line, but stops when A1x is encountered.

 

Second step has the infile option ignoredoseof, so it will read to end of the physical file. If it reads more records than the first, the problem is caused by the A1x.

 


filename in "I:\2016q2.csv";
data _null_;
	infile in lrecl=1;
	input;
run;

filename in "I:\2016q2.csv" ignoredoseof;
data _null_;
	infile in lrecl=1;
	input;
run;

View solution in original post

14 REPLIES 14
SASKiwi
PROC Star

Please post the SAS log of your program.

cphd
Obsidian | Level 7

Edited the main post. Please see the log attached. 

 

The result shows that SAS found 15860453 rows from the file, while the actual total number of rows is 21,173,376 (including the header) according to the Delimit program. 

 

 

 

cphd
Obsidian | Level 7

Thank you for the response. When I checked in the Delimit program (or csv file), I am able to see the data over 21MM data points. Is there anyway I can force SAS to import all data in the csv file or any better way to import all? 

SASKiwi
PROC Star

@cphd  - There is nothing in the log to suggest there are any problems. Does the file size (File Size (bytes)=4555329197) match what you are seeing? 

 

I suggest you copy and paste the DATA step from the SAS log to your program window (you can do a block select of the statements to avoid the line numbers by pressing the ALT key when selecting rows with your mouse). Try re-running this with LRECL= 333 on your INFILE statement as this is the reported maximum record length. 

 

 

cphd
Obsidian | Level 7

The size of the data is correct. I don't know why it drops some observations. It happens to other files that I am working on as well...

 

I re-run the code following your suggestion with LRECL=333, but it imported the same number which is smaller than what it is supposed to be... 

 

 

SASKiwi
PROC Star

I can only think that there must be something unusual about these files that the SAS default file-handling is not picking up.

 

What OS did these files originate from? Maybe you need to experiment with the TERMSTR INFILE option: https://documentation.sas.com/?docsetId=hostwin&docsetTarget=chfnoptfmain.htm&docsetVersion=9.4&loca...

cphd
Obsidian | Level 7

Hi @SASKiwi,

 

Thank you so much for helping with this. I checked the file again. TERMSTR is set up as CRLF... But it has a text qualifier ". Do you think it could cause the importing problem? Delimiter is , and Text Qualifier is ".

 

Also, I divided the csv file into small pieces and then imported & appended the files.  By doing this, I was able to increase the total number of obs (still in short though). However, WARNING: "Multiple lengths were specified for the variable X by input data set(s). This can cause truncation of data" shows up. Do you think it could cause additional problem? 

Kurt_Bremser
Super User

As soon as you have to read multiple files with the same structure (or even just some variables in common), GET RID OF PROC IMPORT!

Take the data step (created by proc import) from the log, adapt it to your needs (documentation of the file!), and use it on all your files. This prevents all problems caused by the continuing guessing of proc import.

Ksharp
Super User

Try option ignoredoseof and termstr=

 

infile 'I:\2016q2.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ignoredoseof termstr=crlf ;
infile 'I:\2016q2.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ignoredoseof termstr=lf ;
Tom
Super User Tom
Super User

As @Ksharp said, try using the IGNOREDOSEOF.  To do that with PROC IMPORT you can define an fileref using the FILENAME statement and then pass the fileref instead of the physical filename to the proc.

filename raw "I:\2016q2.csv" ignoredoseof ;
proc import datafile=raw
  out=ret.r2016q2
  dbms=csv
  replace
;
run;

If that helps then you might have a larger issue with gibberish characters in your text file.  Perhaps you encoding is wrong?

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @cphd 

 

It might be a stray hex 1A (DOS end-of-file) character in data.  try to submit the following code and see what happens. First step reads one character from each input line, but stops when A1x is encountered.

 

Second step has the infile option ignoredoseof, so it will read to end of the physical file. If it reads more records than the first, the problem is caused by the A1x.

 


filename in "I:\2016q2.csv";
data _null_;
	infile in lrecl=1;
	input;
run;

filename in "I:\2016q2.csv" ignoredoseof;
data _null_;
	infile in lrecl=1;
	input;
run;

cphd
Obsidian | Level 7

Thank you so much all for helps. The problem was from the empty rows in A1.

 

The codes below shared by other experts worked perfect! Thank you all! 

 

filename raw "I:\2016q2.csv" ignoredoseof;
proc import datafile=raw
out=redo.r2016q2
dbms=csv
replace;
run;
Tom
Super User Tom
Super User

Note that have the DOS End of File character in your text files has nothing to do with what might have been or not been in the top left cell of some spreadsheet.  But it might indicate some other error in the process that created the file.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 5770 views
  • 12 likes
  • 6 in conversation