Re: Issues with importing a large CSV file

claremc · Posted 07-08-2021 10:58 AM

Hello,

I am trying to import a large CSV file that seems to contain some lines with delimiters that are causing line breaks in false places. For example, I originally tried to import the file with proc import:

proc import datafile = 'M:\Documents\filename.csv' 
out=covid1
dbms=CSV
replace;
getnames=yes;
guessingrows=max; 
run;

The data loaded in without error messages, but most of the lines are shifted so that the wrong values appear for certain variables.

I then tried to use an infile statement with the "termstr=cr" option:

DATA RAW;
INFILE 'M:\Documents\filename.csv' firstobs=2 dsd termstr=cr missover;
INPUT var1 $ var2 $;
Run;

When I did this, only the variable names appeared in my dataset, not the values.

Let me know if there are any tricks for this!

Thanks,

Clare

Tom · Posted 07-08-2021 11:08 AM

The size of the file should not matter (other than making PROC IMPORT take longer the guess how to define the variables).

If you are lucky the embedded line-breaks are using either just CR or just LF and the actual end of lines are marked with CR+LF (the DOS/Windows standard). In that case using TERMSTR=CRLF should allow you to read the file.

Otherwise you need to pre-process the file to remove/replace the embedded line breaks. Then you can read (or even use PROC IMPORT to guess how to read) the file fixed file.

There are literal dozens (if not hundreds) of questions on this forum with solutions that should help you see how to fix the file if the TERMSTR=CRLF does not work.

To see what is in the file use the LIST statement in a data step. If there are CR or LF characters in a line then SAS will print the hex codes for that line and you can see exactly which characters are there. You can use the OBS= and FIRSTOBS= options on the INFILE statement to select which lines to dump. Note this will also help you make sure the delimiter is actually a COMMA as the name CSV implies, or whether this file is using some other character as the delimiter between fields on a line.

data _null_;
  infile 'M:\Documents\filename.csv' obs=5;
  input;
  list;
run;

claremc · Posted 07-08-2021 01:06 PM

Thanks Tom.

I was able to run your code and got this in my log (that's just the beginning)

It seems that many of the columns have the delimiter error. I've read through other posts on this, and many of them select the "problem" lines and deal with them separately. Unfortunately I don't think that is possible for me -- so it sounds like I'll have to do some preprocessing?

ballardw · Posted 07-08-2021 01:38 PM

It might help to tell us exactly what the "delimiter error" is that you think you see.

All @Tom's code is doing is to create text we can with the information of where SAS found a record break. By truncating the lines generated you are removing the information needed.

Copy the LOG text and paste into a text box opened on the forum with the </> icon. That way we can see what the data step saw.

BTW if your data is sensitive then the text box will allow you to type over the characters. Use XXX for character values and 000 (zeroes) for numeric if needed.

claremc · Posted 07-08-2021 02:46 PM

Thanks for the suggestion! Here is the complete log I get after running Tom's code:

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 72         
 73         DATA RAW;
 74         INFILE 'FILENAME.csv' firstobs=2 obs=5;
 75         INPUT;
 76         list;
 77         Run;
 
 NOTE: The infile 'filename.csv' is:
       Filename=filename.csv,
       RECFM=V,LRECL=32767,File Size (bytes)=49471666,
       Last Modified=08Jul2021:02:23:25,
       Create Time=08Jul2021:07:15:26
 
 RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0                      
 2         0,123456,1234567,891011121314,2021-00-00 03:30:40,,2021-00-00 15:37:40,94500-6,TJU SARS-COV-2 RNA QUALI
      101  TATIVE NAAT,,Lab,,,POSITIVE,"REVERSE TRANSCRIPTION POLYMERASE CHAIN REACTION (RT-PCR) 185
 3         ",POSITIVE,F,84,0,XXXXX,XXXX,1900-00-00,(XXX)xxx-xxxx,1234 APPLE RD,,PHILADELPHIA,PA,
      101   race,ENG,191,W,1144,Jefferson Ambulatory,,,,RACE,,,, 161
 4         0,1359946,1345887,403041498,2020-11-24 10:00:00,,2020-11-24 10:00:00,94309-2,TJU SARS-COV-2 RNA QUAL
      101  ITATIVE NAAT,,Lab,"","",POSITIVE,SARS-COV-2 RNA DETECTED.  POSITIVE RESULTS DO NOT RULE OUT BACTERIA
      201  L INFECTION OR CO-INFECTION WITH OTHER VIRUSES. | PENNSYLVANIA DEPARTMENT OF HEALTH WILL BE NOTIFIED
      301  . 0X0D | ,POSITIVE,F,50,0,PATIENT NAME,1900-00-0,(xxx)XXX-xxxx,address,,PHILADELPHIA,PA,
      401  race,,19136,,1168, University Hospital,"","","",Unknown,"","","","" 497
 5         0,1359949,1345890,03083743,2021-01-09 08:21:00,,2021-01-09 08:21:19,94309-2,TJU SARS-COV-2 RNA QUALI
      101  TATIVE NAAT,,Lab,,,POSITIVE,"SARS-COV-2 RNA DETECTED.  POSITIVE RESULTS DO NOT RULE OUT BACTERIAL IN
      201  FECTION OR CO-INFECTION WITH OTHER VIRUSES. 243
 NOTE: 4 records were read from the infile 'M:\DDC\EPI\ACD\COVID\datasTest\HSX\covid19_phila_dph_2021-07-08.csv'.
       The minimum record length was 161.
       The maximum record length was 497.
 NOTE: The data set WORK.RAW has 4 observations and 0 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.02 seconds
       cpu time            0.01 seconds
       
 
 78         
 79         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

As you can see, something is going on with the delimiters. Using the classic csv comma is not working. I'm just not sure how to get around this.

ballardw · Posted 07-08-2021 04:20 PM

A brief discussion on how to read that log.

The line with Rule in the left margin is a mono-space character position record indicator. It has numeral for each 10 positions, the + is a 5 so you can find the column indicators from 1 to 100 (where the 0 appears in the Rule).

There are numbers in the extreme left that indicate the line of the file: 2,3,4 and 5.

To the right of those and left of the position of the first "rule" space indicate is a number when the line exceeds 100 characters. The 101 says the first character under the rule is at column 101 in the source file. If the line is long enough you see a 201, 301 etc. These numbers depend on some system settings. In some environment you might see 81 or 121 or similar. The number that appears after the text on the last log is the total length of a specific data line. For example the Line 2 has 185 total characters.

So looking at that text I would guess, just because of similarity of fields that lines 2, 4 and 5 (from the log left column) are the starts of records and line 3 is supposed to be part of line 2. Would I be correct in guessing that 5 terminates too early as well?

Let us know how good my guesses are.

The examples I'm seeing would show a line with an unbalanced quote on a line. I think there's an example of using the _infile_ automatic variable with that information somewhere on this forum. We'll have to look for it.

claremc · Posted 07-08-2021 04:30 PM

Thanks for the clarification, that was helpful. And your guesses are correct. Lines 2, 4, and 5 are separate records..

I tried using the delimiter statement in the infile line "dlm='0A'x' after reading this documentation- http://support.sas.com/techsup/technote/ts673.pdf ,

but to no avail -

filename rawinput "M:\DDC\EPI\ACD\COVID\datasTest\HSX\covid19_phila_dph_2021-07-08.csv" dsd termstr=cr;
DATA hsx;
INFILE rawinput firstobs=2 dlm='0A'x ;
INPUT var1 $ var2$;
list; 
Run;

Tom · Posted 07-08-2021 05:24 PM

So the file is using comma as the delimiter. It looks like line number 2 has an extra end-of-line inserted just after the right parentheses.

Try running it again using CRLF as the TERMSTR= setting on the INFILE statement and see if line number 2 looks different.

DATA _null_;
  INFILE 'FILENAME.csv' firstobs=2 obs=5 termstr=crlf;
  INPUT;
  LIST;
RUN:

If it still breaks at the same place then you will need to fix the file.

Since it looks like the values with embedded end of line characters are quoted you can use the quote counting method.

For example by replace any CR or LF inside of quotes with pipe characters instead.

DATA _null_;
  INFILE 'FILENAME.csv' recfm=f lrecl=1;
  file 'newfilename.csv' recfm=f lrecl=1;
  input ch $char1. ;
  q+(ch='"');
  q=mod(q,2);
  if q then ch=translate(ch,'||','0D0A'x);
  put ch $char1. ;
run;

Tom · Posted 07-09-2021 11:51 AM

@claremc Were you able to use that data step make a version of the CSV file that can be read by PROC IMPORT or SAS data step?

Here is a link to a SASware Ballot idea asking SAS to enhance support for this file format. Please upvote that ballot entry.

https://communities.sas.com/t5/SASware-Ballot-Ideas/Enhancements-to-INFILE-FILE-to-handle-delimited-...

claremc · Posted 07-12-2021 08:25 AM

Sadly, I was not. Just upvoted the post!

claremc · Posted 07-12-2021 08:24 AM

Thanks for your help Tom. Looks like I'll need to try and get the people who send me this data to change the file 😞

SAS Innovate 2025: Call for Content

Classroom Training Available!