How to import a large csv file (having over 5,000,000 rows) that obtains multiple columns that exceed lrecl 32767 into sas? For instance, a column named "consumer_complaint_narrative" exceeds 32767. I've tried consumer_complaint_narrative $100000 but it won't work.
My current code is as follows:
data want;
infile 'D:\file.csv' TRUNCOVER DSD lrecl=32767 firstobs=2;
length
date_received $10 product $32767 sub_product $32767 issue $32767 sub_issue $32767
consumer_complaint_narrative $32767 company_public_response $32767 company $32767
state $30 zip_code $10 tags $32767 consumer_consent_provided_ $32767 submitted_via $32767
date_sent_to_company $32767 timely_response_ $32767 consumer_disputed_ $32767 complaint_id $32767;
input date_received -- complaint_id;
run;
So you are saying that some of the individual values are longer than 32,767 bytes?
One obvious thing is you cannot then use 32,767 as the record length if just one of the values in the record could be larger than that. You should be able change the LRECL= setting to 2 million or 10 million bytes (limit might vary by operating system and perhaps available memory).
The other concern is that values that long are much more likely to contain end of line characters (such as CR and/or LF or worst of all actual CR+LF combinations). Those cause havoc for SAS's ability to parse the file into observations and fields. If the values with embedded end of line strings are quoted you might have some luck first pre-processing the file to replace those. https://github.com/sasutils/macros/blob/master/replace_crlf.sas
So if you just want to truncate the long individual values then modify your data step by just changing the LRECL= setting to something larger than the longest line.
If you actually want to read the long strings into multiple variables then you might be helped by using the COLUMN= option on the INFILE statement to name a variable to show you where on the line SAS will next read. This will let you detect the extra long values and then perhaps be able to read it by pieces.
Example using using 10 character limit instead of 32767 to illustrate the method.
filename csv temp;
options parmcards=csv;
parmcards4;
id,name,str1,str2,number
1,Short,one,two,3
2,Empty,,,4
3,Long,This is longer than 10 characters,short,5
;;;;
data want;
length id 8 name $8 str1_1 - str1_4 $10 str2 $10 number 8;
infile csv firstobs=2 dsd truncover lrecl=100000 column=column ;
input id name @;
start=column;
input str1_1 @ ;
end=column;
array str1 str1_1-str1_4;
index=1;
do start=start to end-1 by 10;
input @start str1[index] @;
index+1;
end;
input @end str2 number;
drop start end index;
run;
proc print;
run;
Obs id name str1_1 str1_2 str1_3 str1_4 str2 number 1 1 Short one two 3 2 2 Empty 4 3 3 Long This is lo nger than 10 charact ers short 5
Sorry I am lost. I've attached a sample data that looks like this.
To make it easy, how could I just read those columns that are not so long and not dropping any rows? When I just read "date_received" column, some rows look alright but some other rows read some other information.
The first thing to do is to LOOK at the file and make sure it looks like a valid CSV file.
860 data _null_; 861 infile csv obs=10 lrecl=1000000; 862 input; 863 list; 864 run; NOTE: The infile CSV is: (system-specific pathname), (system-specific file attributes) RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public respon 101 se,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company 201 response to consumer,Timely response?,Consumer disputed?,Complaint ID 270 2 9/16/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repo 101 rting,Improper use of your report,Reporting company used your report improperly,,,"TRANSUNION INTERM 201 EDIATE HOLDINGS, INC.",PA,19040,,,Web,9/16/2022,In progress,Yes,N/A,5990357 275 3 9/9/2022,Credit card or prepaid card,General-purpose credit card or charge card,Problem with a purch 101 ase shown on your statement,Card was charged for something you did not purchase with the card,,,"BAN 201 K OF AMERICA, NATIONAL ASSOCIATION",AZ,85224,,,Web,9/9/2022,In progress,Yes,N/A,5966073 287 4 9/3/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repor 101 ting,Incorrect information on your report,Information belongs to someone else,,Company has responded 201 to the consumer and the CFPB and chooses not to provide a public response,Experian Information Solu 301 tions Inc.,AL,35473,,Consent not provided,Web,9/3/2022,Closed with non-monetary relief,Yes,N/A,59434 401 33 402 5 3/6/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repor 101 ting,Incorrect information on your report,Information is missing that should be on the report,"Infor 201 mation is missing from my credit profile, that being a XXXX card in my name with the following accou 301 nt # XXXX XXXX XXXX XXXX. 326 6 XXXX XXXX added me as an authorized user on his XXXX card back in XX/XX/2021. XXXX did approve and a 101 dd me on as an authorized user in XX/XX/2021 providing me with the following account # XXXX XXXX XXX 201 X XXXX. 208 7 I am requesting XXXX and Equifax to verify this information with XXXX Card services. 85 8 I am also asking XXXX and Equifax to verify XXXX authorized users policy which states : *XXXX repor 101 ts the account credit history to the three major credit bureaus as to you and the Authorized User. T 201 his can help build the Authorized User 's credit history if used responsibly. Late payments, delinqu 301 encies or other derogatory activity with your credit card accounts and loans may adversely impact yo 401 urs and the Authorized User 's ability to build credit. 456 9 I am also requesting the missing information be added to my credit profile",,"EQUIFAX, INC.",MI,4833 101 4,Servicemember,Consent provided,Web,3/6/2022,Closed with explanation,Yes,N/A,5290866 185 10 7/1/2022,Mortgage,Conventional home mortgage,Trouble during payment process,,"When going to pay my m 101 ortgage today I was told I could not make a partial payment. My partner and I have been making parti 201 al payments ( half and half ) on this exact mortgage for over two years since their online services 301 were so messy. Now I'm told I have to visit a branch to make a partial payment ... I live in Iowa, t 401 here are no branches here. 427 NOTE: 10 records were read from the infile (system-specific pathname). The minimum record length was 85. The maximum record length was 456. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
There is no issue with long values here. The longest line is less than 500 characters. But there is a problem, lines 6,7,8 and 9 do NOT start with the DATE value. Instead the appear to be part of one or more of the character string variables. So it looks like some of the strings contain END OF LINE markers.
If you are lucky the strings contain only CR or only LF characters and the actual lines end with CR+LF combination (the standard for a Windows/DOS file and also the standard for a CSV file).
So run the code to look a the start of the file again, only this time add the TERMSTR=CRLF option.
867 data _null_; 868 infile csv obs=10 lrecl=1000000 termstr=crlf; 869 input; 870 list; 871 run; NOTE: The infile CSV is: (system-specific pathname), (system-specific file attributes) RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public respon RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 101 se,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company 201 response to consumer,Timely response?,Consumer disputed?,Complaint ID 270 2 9/16/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repo 101 rting,Improper use of your report,Reporting company used your report improperly,,,"TRANSUNION INTERM 201 EDIATE HOLDINGS, INC.",PA,19040,,,Web,9/16/2022,In progress,Yes,N/A,5990357 275 3 9/9/2022,Credit card or prepaid card,General-purpose credit card or charge card,Problem with a purch 101 ase shown on your statement,Card was charged for something you did not purchase with the card,,,"BAN 201 K OF AMERICA, NATIONAL ASSOCIATION",AZ,85224,,,Web,9/9/2022,In progress,Yes,N/A,5966073 287 4 9/3/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repor 101 ting,Incorrect information on your report,Information belongs to someone else,,Company has responded 201 to the consumer and the CFPB and chooses not to provide a public response,Experian Information Solu 301 tions Inc.,AL,35473,,Consent not provided,Web,9/3/2022,Closed with non-monetary relief,Yes,N/A,59434 401 33 402 5 3/6/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repor 101 ting,Incorrect information on your report,Information is missing that should be on the report,"Infor 201 mation is missing from my credit profile, that being a XXXX card in my name with the following accou 301 nt # XXXX XXXX XXXX XXXX. .XXXX XXXX added me as an authorized user on his XXXX card back in XX/XX/2 ZONE 6722255552555525555255552205555255552666662662672662677667676627767266266725555266762666626625525523 NUMR E40308888088880888808888E0A8888088880144540D501301E01548F29A54053520FE089308888031240213B09E088F88F2 401 021. XXXX did approve and add me on as an authorized user in XX/XX/2021 providing me with the follow 501 ing account # XXXX XXXX XXXX XXXX. .I am requesting XXXX and Equifax to verify this information with ZONE 6662666676722255552555525555255552204266276776776662555526662477666727627676672766726666766766627676 NUMR 9E70133F5E40308888088880888808888E0A901D025155349E70888801E40515961804F06529690489309E6F2D149FE07948 601 XXXX Card services. .I am also asking XXXX and Equifax to verify XXXX authorized users policy whic ZONE 2555524676276776667220426626676267666625555266624776667276276766725555226776676766277677276666727666 NUMR 0888803124035269353E0A901D01C3F013B9E70888801E40515961804F065296908888001548F29A5405352300FC93907893 701 h states : *XXXX reports the account credit history to the three major credit bureaus as to you and 801 the Authorized User. This can help build the Authorized User 's credit history if used responsibly. 901 Late payments, delinquencies or other derogatory activity with your credit card accounts and loans m 1001 ay adversely impact yours and the Authorized User 's ability to build credit. .I am also requesting ZONE 6726676776672667667276777266627662477667676625767227266666772762676662676667220426626676276776776662 NUMR 1901465235C909D013409F52301E4048501548F29A54053520730129C94904F0259C40325494E0A901D01C3F025155349E70 1101 the missing information be added to my credit profile",,"EQUIFAX, INC.",MI,48334,Servicemember,Conse 1201 nt provided,Web,3/6/2022,Closed with explanation,Yes,N/A,5290866 1264 6 7/1/2022,Mortgage,Conventional home mortgage,Trouble during payment process,,"When going to pay my m 101 ortgage today I was told I could not make a partial payment. My partner and I have been making parti 201 al payments ( half and half ) on this exact mortgage for over two years since their online services 301 were so messy. Now I'm told I have to visit a branch to make a partial payment ... I live in Iowa, t 401 here are no branches here. ..Previously, in XXXX of 2021, they accidentally double charged me and re ZONE 6676267626626766666726676220057676677672266255552662333322766726666666766672667666266676662662666276 NUMR 852501250EF0221E385308525E0AA02569F53C9C09E088880F602021C048590133945E41CC904F52C5038127540D501E4025 501 solved it by wiring back in the second payment, incurring a {$10.00} domestic wire fee with my bank. 601 After multiple attempts to be reimbursed for this charge, the bank has continued to do nothing to f 701 ix it. This has been an issue since XXXX.",Company has responded to the consumer and the CFPB and ch 801 ooses not to provide a public response,TRUIST FINANCIAL CORPORATION,IA,50312,,Consent provided,Web,7 901 /1/2022,Closed with explanation,Yes,N/A,5726123 947 7 9/13/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repo 101 rting,Problem with a credit reporting company's investigation into an existing problem,Was not notif 201 ied of investigation status or results,,,"EQUIFAX, INC.",VA,22193,,N/A,Phone,9/17/2022,In progress,Y 301 es,N/A,5978401 314 8 8/19/2022,Debt collection,Credit card debt,Took or threatened to take negative or legal action,Seize 101 d or attempted to seize your property,,,"BANK OF AMERICA, NATIONAL ASSOCIATION",GA,30052,,,Web,9/7/2 201 022,In progress,Yes,N/A,5891853 231 9 9/6/2022,Checking or savings account,Checking account,Managing an account,Problem using a debit or A RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 101 TM card,,,"BANK OF AMERICA, NATIONAL ASSOCIATION",CA,91402,,,Web,9/6/2022,In progress,Yes,N/A,595442 201 3 201 10 9/7/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repor 101 ting,Incorrect information on your report,Account information incorrect,,,"BANK OF AMERICA, NATIONAL 201 ASSOCIATION",AR,72211,,,Web,9/7/2022,In progress,Yes,N/A,5958483 265 NOTE: 10 records were read from the infile (system-specific pathname). The minimum record length was 201. The maximum record length was 1264. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
So that looks more promising.
You could try seeing if PROC IMPORT can GUESS how to read the file.
filename csv 'c:\downloads\complaints.csv' termstr=crlf lrecl=1000000;
proc import datafile=csv out=want replace dbms=csv;
run;
For these 26 observations you only have one really long variable.
But you will get much better results if you instead write your own data step to read the file. That way you can avoid the mistakes that PROC IMPORT makes. Like defining ZIPCODE as a NUMBER when it is really a STRING. Or attaching $ format and and informats to character variables. Or using BEST as if it was the name of an INFORMAT (and in a place where there is no need for an informat).
Thank you @Tom! That is really helpful!
I tried your code below with the original file (size: 1,909,452 KB), and I get an error message saying
"Unable to sample external file, no data in first 5 records.
ERROR: Import unsuccessful. See SAS Log for details.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 37.40 seconds
cpu time 37.64 seconds"
filename csv 'c:\downloads\complaints.csv' termstr=crlf lrecl=1000000;
proc import datafile=csv out=want replace dbms=csv;
run;
Is it because the size is too big to read with the code?
It is probably because the file looks nothing like the small file you posted.
Generally that is the error you get when the file is empty or consists of just one line. Since the file is not empty if it is that large then it probably has no CR+LF pairs, unlike the example file you posted.
Try the OTHER code that uses a DATA step and the LIST command to LOOK at the actual content of the file.
There are three options for the TERMSTR= option, CRLF (Windows), LF (Unix), and CR (which only confused version of Excel running on MacOS create now since MacOS many years ago converted to being just another version of Unix). Try each of them and see if you can tell what is happening.
You might also try just reading the file as fixed length (so that the end of line characters are consumed as normal characters) and see what happens. Something like will read the first 2,000 bytes from the file.
data _null_;
infile csv recfm=f lrecl=100 obs=20;
input;
list;
run;
I uploaded your file to SAS On Demand and ran this:
data complaints;
infile "~/complaints.csv" dlm="," dsd firstobs=2 lrecl=1000000 length=l termstr=crLF;
input
date_received :mmddyy10.
product :$1000.
sub_product :$1000.
issue :$1000.
sub_issue :$1000.
consumer_complaint_narrative :$32767.
company_public_response :$32767.
company :$100.
state :$2.
zipcode :$5.
tags :$100.
consumer_consent_provided :$20.
submitted_via :$10.
date_sent_to_company :mmddyy10.
company_response_to_consumer :$32767.
timely_response :$3.
consumer_disputed :$10.
complaint_id :$10.
;
format date_received date_sent_to_company yymmdd10.;
if l > 30000 then put l= _N_=;
run;
which successfully read the data.
If you have any issues with this (adapt the INFILE path), post the log (include some of the messages for lines where the data step complains).
Thank you @Kurt_Bremser
When I ran your code, I encountered with this error message.
NOTE: The infile "~\complaints.csv" is:
Filename=~\complaints.csv,
RECFM=V,LRECL=1000000,
File Size (bytes)=1955278742,
Last Modified=19Sep2022:05:39:44,
Create Time=19Sep2022:05:39:44
NOTE: 0 records were read from the infile
"~\complaints.csv".
NOTE: The data set WORK.COMPLAINTS has 0 observations and 18 variables.
NOTE: DATA statement used (Total process time):
real time 1:01.15
cpu time 4.37 seconds
So you have file that is 1,955,278,742 bytes long and it only contains one LINE.
Perhaps it is NOT a text file at all.
Please LOOK at the file.
The LIST statement is your friend here.
This will show you the first 2,000 bytes of that file. If any of the bytes are not printable characters, such as the CR or LF characters you would normally see at the end of lines, then it will also show the actual HEX codes for every byte in the block of 100 bytes.
data _null_;
infile 'complaints.csv' lrecl=100 recfm=f obs=20 ;
input;
list;
run;
Note that a 2 billion byte file with 5 millions lines would have an average line length of only 400 bytes. So perhaps it is a lot like the 26 line file you posted before. You just need to try and figure out why you cannot see the line breaks. Is the file using some other strange characters between the lines instead?
Note that the most common source of the error you are showing is trying to read a text file that is using just CR as the end of line character.
Example that creates a CSV using CR as end of line characters and then tries to read it with PROC IMPORT without telling it that CR is the end of line characters.
1424 filename csv temp ; 1425 data _null_; 1426 file csv termstr=cr dsd ; 1427 set sashelp.class; 1428 if _n_=1 then put 'Name,Sex,Age,Height,Weight'; 1429 put Name Sex Age Height Weight ; 1430 run; NOTE: The file CSV is: (system-specific pathname), (system-specific file attributes) NOTE: 20 records were written to the file (system-specific pathname). The minimum record length was 17. The maximum record length was 26. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 1431 1432 proc import datafile=csv dbms=csv out=want replace; 1433 run; Unable to sample external file, no data in first 5 records. ERROR: Import unsuccessful. See SAS Log for details. NOTE: The SAS System stopped processing this step because of errors. NOTE: PROCEDURE IMPORT used (Total process time): real time 0.03 seconds cpu time 0.03 seconds
Then you used a diifferent file than the one you attached. I tested my code with that file, and it WORKS.
Open your file with Notepad++. It will tell you the file format (UNIX or Windows), which determines the TERMSTR, and will reveal any special characters.
It also enables you to shorten the file (without losing information) so you can attach it here.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.