BookmarkSubscribeRSS Feed
latte747
Fluorite | Level 6

How to import a large csv file (having over 5,000,000 rows) that obtains multiple columns that exceed lrecl 32767 into sas? For instance, a column named "consumer_complaint_narrative" exceeds 32767. I've tried consumer_complaint_narrative $100000 but it won't work.

 

My current code is as follows:

 

data want;
infile 'D:\file.csv' TRUNCOVER DSD lrecl=32767 firstobs=2;
length
date_received $10 product $32767 sub_product $32767 issue $32767 sub_issue $32767
consumer_complaint_narrative $32767 company_public_response $32767 company $32767
state $30 zip_code $10 tags $32767 consumer_consent_provided_ $32767 submitted_via $32767
date_sent_to_company $32767 timely_response_ $32767 consumer_disputed_ $32767 complaint_id $32767;
input date_received -- complaint_id;
run; 

 

 

9 REPLIES 9
Tom
Super User Tom
Super User

So you are saying that some of the individual values are longer than 32,767 bytes?

 

One obvious thing is you cannot then use 32,767 as the record length if just one of the values in the record could be larger than that.  You should be able change the LRECL= setting to 2 million or 10 million bytes (limit might vary by operating system and perhaps available memory).

 

The other concern is that values that long are much more likely to contain end of line characters (such as CR and/or LF or worst of all actual CR+LF combinations).  Those cause havoc for SAS's ability to parse the file into observations and fields.  If the values with embedded end of line strings are quoted you might have some luck first pre-processing the file to replace those.  https://github.com/sasutils/macros/blob/master/replace_crlf.sas

 

So if you just want to truncate the long individual values then modify your data step by just changing the LRECL= setting to something larger than the longest line.

 

If you actually want to read the long strings into multiple variables then you might be helped by using the COLUMN= option on the INFILE statement to name a variable to show you where on the line SAS will next read.  This will let you detect the extra long values and then perhaps be able to read it by pieces.

 

Example using using 10 character limit instead of 32767 to illustrate the method.

 

filename csv temp;
options parmcards=csv;
parmcards4;
id,name,str1,str2,number
1,Short,one,two,3
2,Empty,,,4
3,Long,This is longer than 10 characters,short,5
;;;;

data want;
  length id 8 name $8 str1_1 - str1_4 $10 str2 $10 number 8;
  infile csv firstobs=2 dsd truncover lrecl=100000 column=column ;
  input id name @;
  start=column;
  input str1_1 @ ;
  end=column;
  array str1 str1_1-str1_4;
  index=1;
  do start=start to end-1 by 10;
    input @start str1[index] @;
    index+1;
  end;
  input @end str2 number;
  drop start end index;
run;

proc print;
run;
Obs    id    name     str1_1         str1_2        str1_3      str1_4    str2     number

 1      1    Short    one                                                two         3
 2      2    Empty                                                                   4
 3      3    Long     This is lo    nger than    10 charact     ers      short       5

 

latte747
Fluorite | Level 6

Sorry I am lost. I've attached a sample data that looks like this. 

 

To make it easy, how could I just read those columns that are not so long and not dropping any rows?  When I just read "date_received" column, some rows look alright but some other rows read some other information.

 

 

Tom
Super User Tom
Super User

The first thing to do is to LOOK at the file and make sure it looks like a valid CSV file.

860  data _null_;
861   infile csv obs=10 lrecl=1000000;
862   input;
863   list;
864  run;

NOTE: The infile CSV is:
      (system-specific pathname),
      (system-specific file attributes)

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public respon
     101  se,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company
     201   response to consumer,Timely response?,Consumer disputed?,Complaint ID 270
2         9/16/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repo
     101  rting,Improper use of your report,Reporting company used your report improperly,,,"TRANSUNION INTERM
     201  EDIATE HOLDINGS, INC.",PA,19040,,,Web,9/16/2022,In progress,Yes,N/A,5990357 275
3         9/9/2022,Credit card or prepaid card,General-purpose credit card or charge card,Problem with a purch
     101  ase shown on your statement,Card was charged for something you did not purchase with the card,,,"BAN
     201  K OF AMERICA, NATIONAL ASSOCIATION",AZ,85224,,,Web,9/9/2022,In progress,Yes,N/A,5966073 287
4         9/3/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repor
     101  ting,Incorrect information on your report,Information belongs to someone else,,Company has responded
     201   to the consumer and the CFPB and chooses not to provide a public response,Experian Information Solu
     301  tions Inc.,AL,35473,,Consent not provided,Web,9/3/2022,Closed with non-monetary relief,Yes,N/A,59434
     401  33 402
5         3/6/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repor
     101  ting,Incorrect information on your report,Information is missing that should be on the report,"Infor
     201  mation is missing from my credit profile, that being a XXXX card in my name with the following accou
     301  nt # XXXX XXXX XXXX XXXX.  326
6         XXXX XXXX added me as an authorized user on his XXXX card back in XX/XX/2021. XXXX did approve and a
     101  dd me on as an authorized user in XX/XX/2021 providing me with the following account # XXXX XXXX XXX
     201  X XXXX.  208
7         I am requesting XXXX and Equifax to verify this information with XXXX Card services.  85
8         I am also asking XXXX and Equifax to verify XXXX  authorized users policy which states : *XXXX repor
     101  ts the account credit history to the three major credit bureaus as to you and the Authorized User. T
     201  his can help build the Authorized User 's credit history if used responsibly. Late payments, delinqu
     301  encies or other derogatory activity with your credit card accounts and loans may adversely impact yo
     401  urs and the Authorized User 's ability to build credit.  456
9         I am also requesting the missing information be added to my credit profile",,"EQUIFAX, INC.",MI,4833
     101  4,Servicemember,Consent provided,Web,3/6/2022,Closed with explanation,Yes,N/A,5290866 185
10        7/1/2022,Mortgage,Conventional home mortgage,Trouble during payment process,,"When going to pay my m
     101  ortgage today I was told I could not make a partial payment. My partner and I have been making parti
     201  al payments ( half and half ) on this exact mortgage for over two years since their online services
     301  were so messy. Now I'm told I have to visit a branch to make a partial payment ... I live in Iowa, t
     401  here are no branches here.  427
NOTE: 10 records were read from the infile (system-specific pathname).
      The minimum record length was 85.
      The maximum record length was 456.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

There is no issue with long values here. The longest line is less than 500 characters.  But there is a problem,  lines 6,7,8 and 9 do NOT start with the DATE value. Instead the appear to be part of one or more of the character string variables. So it looks like some of the strings contain END OF LINE markers.

 

If you are lucky the strings contain only CR or only LF characters and the actual lines end with CR+LF combination (the standard for a Windows/DOS file and also the standard for a CSV file).

 

So run the code to look a the start of the file again, only this time add the TERMSTR=CRLF option.

867  data _null_;
868   infile csv obs=10 lrecl=1000000 termstr=crlf;
869   input;
870   list;
871  run;

NOTE: The infile CSV is:
      (system-specific pathname),
      (system-specific file attributes)

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public respon
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
     101  se,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company
     201   response to consumer,Timely response?,Consumer disputed?,Complaint ID 270
2         9/16/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repo
     101  rting,Improper use of your report,Reporting company used your report improperly,,,"TRANSUNION INTERM
     201  EDIATE HOLDINGS, INC.",PA,19040,,,Web,9/16/2022,In progress,Yes,N/A,5990357 275
3         9/9/2022,Credit card or prepaid card,General-purpose credit card or charge card,Problem with a purch
     101  ase shown on your statement,Card was charged for something you did not purchase with the card,,,"BAN
     201  K OF AMERICA, NATIONAL ASSOCIATION",AZ,85224,,,Web,9/9/2022,In progress,Yes,N/A,5966073 287
4         9/3/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repor
     101  ting,Incorrect information on your report,Information belongs to someone else,,Company has responded
     201   to the consumer and the CFPB and chooses not to provide a public response,Experian Information Solu
     301  tions Inc.,AL,35473,,Consent not provided,Web,9/3/2022,Closed with non-monetary relief,Yes,N/A,59434
     401  33 402
5         3/6/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repor
     101  ting,Incorrect information on your report,Information is missing that should be on the report,"Infor
     201  mation is missing from my credit profile, that being a XXXX card in my name with the following accou

     301  nt # XXXX XXXX XXXX XXXX. .XXXX XXXX added me as an authorized user on his XXXX card back in XX/XX/2
    ZONE  6722255552555525555255552205555255552666662662672662677667676627767266266725555266762666626625525523
    NUMR  E40308888088880888808888E0A8888088880144540D501301E01548F29A54053520FE089308888031240213B09E088F88F2
     401  021. XXXX did approve and add me on as an authorized user in XX/XX/2021 providing me with the follow

     501  ing account # XXXX XXXX XXXX XXXX. .I am requesting XXXX and Equifax to verify this information with
    ZONE  6662666676722255552555525555255552204266276776776662555526662477666727627676672766726666766766627676
    NUMR  9E70133F5E40308888088880888808888E0A901D025155349E70888801E40515961804F06529690489309E6F2D149FE07948

     601   XXXX Card services. .I am also asking XXXX and Equifax to verify XXXX  authorized users policy whic
    ZONE  2555524676276776667220426626676267666625555266624776667276276766725555226776676766277677276666727666
    NUMR  0888803124035269353E0A901D01C3F013B9E70888801E40515961804F065296908888001548F29A5405352300FC93907893
     701  h states : *XXXX reports the account credit history to the three major credit bureaus as to you and
     801  the Authorized User. This can help build the Authorized User 's credit history if used responsibly.
     901  Late payments, delinquencies or other derogatory activity with your credit card accounts and loans m

    1001  ay adversely impact yours and the Authorized User 's ability to build credit. .I am also requesting
    ZONE  6726676776672667667276777266627662477667676625767227266666772762676662676667220426626676276776776662
    NUMR  1901465235C909D013409F52301E4048501548F29A54053520730129C94904F0259C40325494E0A901D01C3F025155349E70
    1101  the missing information be added to my credit profile",,"EQUIFAX, INC.",MI,48334,Servicemember,Conse
    1201  nt provided,Web,3/6/2022,Closed with explanation,Yes,N/A,5290866 1264
6         7/1/2022,Mortgage,Conventional home mortgage,Trouble during payment process,,"When going to pay my m
     101  ortgage today I was told I could not make a partial payment. My partner and I have been making parti
     201  al payments ( half and half ) on this exact mortgage for over two years since their online services
     301  were so messy. Now I'm told I have to visit a branch to make a partial payment ... I live in Iowa, t

     401  here are no branches here. ..Previously, in XXXX of 2021, they accidentally double charged me and re
    ZONE  6676267626626766666726676220057676677672266255552662333322766726666666766672667666266676662662666276
    NUMR  852501250EF0221E385308525E0AA02569F53C9C09E088880F602021C048590133945E41CC904F52C5038127540D501E4025
     501  solved it by wiring back in the second payment, incurring a {$10.00} domestic wire fee with my bank.
     601   After multiple attempts to be reimbursed for this charge, the bank has continued to do nothing to f
     701  ix it. This has been an issue since XXXX.",Company has responded to the consumer and the CFPB and ch
     801  ooses not to provide a public response,TRUIST FINANCIAL CORPORATION,IA,50312,,Consent provided,Web,7
     901  /1/2022,Closed with explanation,Yes,N/A,5726123 947
7         9/13/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repo
     101  rting,Problem with a credit reporting company's investigation into an existing problem,Was not notif
     201  ied of investigation status or results,,,"EQUIFAX, INC.",VA,22193,,N/A,Phone,9/17/2022,In progress,Y
     301  es,N/A,5978401 314
8         8/19/2022,Debt collection,Credit card debt,Took or threatened to take negative or legal action,Seize
     101  d or attempted to seize your property,,,"BANK OF AMERICA, NATIONAL ASSOCIATION",GA,30052,,,Web,9/7/2
     201  022,In progress,Yes,N/A,5891853 231
9         9/6/2022,Checking or savings account,Checking account,Managing an account,Problem using a debit or A
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
     101  TM card,,,"BANK OF AMERICA, NATIONAL ASSOCIATION",CA,91402,,,Web,9/6/2022,In progress,Yes,N/A,595442
     201  3 201
10        9/7/2022,"Credit reporting, credit repair services, or other personal consumer reports",Credit repor
     101  ting,Incorrect information on your report,Account information incorrect,,,"BANK OF AMERICA, NATIONAL
     201   ASSOCIATION",AR,72211,,,Web,9/7/2022,In progress,Yes,N/A,5958483 265
NOTE: 10 records were read from the infile (system-specific pathname).
      The minimum record length was 201.
      The maximum record length was 1264.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

So that looks more promising.

 

You could try seeing if PROC IMPORT can GUESS how to read the file.

filename csv 'c:\downloads\complaints.csv' termstr=crlf lrecl=1000000;
proc import datafile=csv out=want replace dbms=csv;
run;

For these 26 observations you only have one really long variable.

Tom_0-1664919729740.png

But you will get much better results if you instead write your own data step to read the file.  That way you can avoid the mistakes that PROC IMPORT makes.  Like defining ZIPCODE as a NUMBER when it is really a STRING.  Or attaching $ format and and informats to character variables. Or using BEST as if it was the name of an INFORMAT (and in a place where there is no need for an informat).

latte747
Fluorite | Level 6

Thank you @Tom! That is really helpful!

I tried your code below with the original file (size: 1,909,452 KB), and I get an error message saying

"Unable to sample external file, no data in first 5 records.
ERROR: Import unsuccessful. See SAS Log for details.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 37.40 seconds
cpu time 37.64 seconds"

filename csv 'c:\downloads\complaints.csv' termstr=crlf lrecl=1000000;
proc import datafile=csv out=want replace dbms=csv;
run;

 

Is it because the size is too big to read with the code?

Tom
Super User Tom
Super User

It is probably because the file looks nothing like the small file you posted.

 

Generally that is the error you get when the file is empty or consists of just one line.  Since the file is not empty if it is that large then it probably has no CR+LF pairs, unlike the example file you posted.

 

Try the OTHER code that uses a DATA step and the LIST command to LOOK at the actual content of the file.

There are three options for the TERMSTR= option, CRLF (Windows), LF (Unix), and CR (which only confused version of Excel running on MacOS create now since MacOS many years ago converted to being just another  version of Unix).  Try each of them and see if you can tell what is happening.

 

You might also try just reading the file as fixed length (so that the end of line characters are consumed as normal characters) and see what happens.  Something like will read the first 2,000 bytes from the file.

data _null_;
  infile csv recfm=f lrecl=100 obs=20;
  input;
  list;
run;

 

Kurt_Bremser
Super User

I uploaded your file to SAS On Demand and ran this:

data complaints;
infile "~/complaints.csv" dlm="," dsd firstobs=2 lrecl=1000000 length=l termstr=crLF;
input
  date_received :mmddyy10.
  product :$1000.
  sub_product :$1000.
  issue :$1000.
  sub_issue :$1000.
  consumer_complaint_narrative :$32767.
  company_public_response :$32767.
  company :$100.
  state :$2.
  zipcode :$5.
  tags :$100.
  consumer_consent_provided :$20.
  submitted_via :$10.
  date_sent_to_company :mmddyy10.
  company_response_to_consumer :$32767.
  timely_response :$3.
  consumer_disputed :$10.
  complaint_id :$10.
;
format date_received date_sent_to_company yymmdd10.;
if l > 30000 then put l= _N_=;
run;

which successfully read the data.

If you have any issues with this (adapt the INFILE path), post the log (include some of the messages for lines where the data step complains).

latte747
Fluorite | Level 6

Thank you @Kurt_Bremser 

When I ran your code, I encountered with this error message.

 

NOTE: The infile "~\complaints.csv" is:

Filename=~\complaints.csv,
RECFM=V,LRECL=1000000,
File Size (bytes)=1955278742,
Last Modified=19Sep2022:05:39:44,
Create Time=19Sep2022:05:39:44

NOTE: 0 records were read from the infile
"~\complaints.csv".
NOTE: The data set WORK.COMPLAINTS has 0 observations and 18 variables.
NOTE: DATA statement used (Total process time):
real time 1:01.15
cpu time 4.37 seconds

 

Tom
Super User Tom
Super User

So you have file that is 1,955,278,742 bytes long and it only contains one LINE.

Perhaps it is NOT a text file at all.

 

Please LOOK at the file.

The LIST statement is your friend here.

This will show you the first 2,000 bytes of that file.  If any of the bytes are not printable characters, such as the CR or LF characters you would normally see at the end of lines, then it will also show the actual HEX codes for every byte in the block of 100 bytes.

data _null_;
  infile 'complaints.csv' lrecl=100 recfm=f obs=20 ;
  input;
  list;
run;

 

Note that a 2 billion byte file with 5 millions lines would have an average line length of only 400 bytes.  So perhaps it is a lot like the 26 line file you posted before.  You just need to try and figure out why you cannot see the line breaks.  Is the file using some other strange characters between the lines instead?

 

Note that the most common source of the error you are showing is trying to read a text file that is using just CR as the end of line character.

 

Example that creates a CSV using CR as end of line characters and then tries to read it with PROC IMPORT without telling it that CR is the end of line characters.

1424  filename csv temp ;
1425  data _null_;
1426    file csv termstr=cr dsd ;
1427    set sashelp.class;
1428    if _n_=1 then put 'Name,Sex,Age,Height,Weight';
1429    put Name Sex Age Height Weight ;
1430  run;

NOTE: The file CSV is:
      (system-specific pathname),
      (system-specific file attributes)

NOTE: 20 records were written to the file (system-specific pathname).
      The minimum record length was 17.
      The maximum record length was 26.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


1431
1432  proc import datafile=csv dbms=csv out=want replace;
1433  run;

Unable to sample external file, no data in first 5 records.
ERROR: Import unsuccessful.  See SAS Log for details.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           0.03 seconds
      cpu time            0.03 seconds

 

Kurt_Bremser
Super User

Then you used a diifferent file than the one you attached. I tested my code with that file, and it WORKS.

Open your file with Notepad++. It will tell you the file format (UNIX or Windows), which determines the TERMSTR, and will reveal any special characters.

It also enables you to shorten the file (without losing information) so you can attach it here.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 2432 views
  • 2 likes
  • 3 in conversation