BookmarkSubscribeRSS Feed
claremc
Obsidian | Level 7

Hello, 

 

I am trying to import a large CSV file that seems to contain some lines with delimiters that are causing line breaks in false places. For example, I originally tried to import the file with proc import: 

proc import datafile = 'M:\Documents\filename.csv' 
out=covid1
dbms=CSV
replace;
getnames=yes;
guessingrows=max; 
run;

 The data loaded in without error messages, but most of the lines are shifted so that the wrong values appear for certain variables.

I then tried to use an infile statement with the "termstr=cr" option:

DATA RAW;
INFILE 'M:\Documents\filename.csv' firstobs=2 dsd termstr=cr missover;
INPUT var1 $ var2 $;
Run;

 When I did this, only the variable names appeared in my dataset, not the values. 

 

Let me know if there are any tricks for this! 

Thanks, 

Clare

10 REPLIES 10
Tom
Super User Tom
Super User

The size of the file should not matter (other than making PROC IMPORT take longer the guess how to define the variables).

 

If you are lucky the embedded line-breaks are using either just CR or just LF and the actual end of lines are marked with CR+LF (the DOS/Windows standard).  In that case using TERMSTR=CRLF should allow you to read the file.

 

Otherwise you need to pre-process the file to remove/replace the embedded line breaks.  Then you can read (or even use PROC IMPORT to guess how to read) the file fixed file.

 

There are literal dozens (if not hundreds) of questions on this forum with solutions that should help you see how to fix the file if the TERMSTR=CRLF does not work.

 

To see what is in the file use the LIST statement in a data step.  If there are CR or LF characters in a line then SAS will print the hex codes for that line and you can see exactly which characters are there.  You can use the OBS= and FIRSTOBS= options on the INFILE statement to select which lines to dump.  Note this will also help you make sure the delimiter is actually a COMMA as the name CSV implies, or whether this file is using some other character as the delimiter between fields on a line.

data _null_;
  infile 'M:\Documents\filename.csv' obs=5;
  input;
  list;
run;
claremc
Obsidian | Level 7

Thanks Tom. 

 

I was able to run your code and got this in my log (that's just the beginning)

 

claremc_0-1625763903758.png

 

It seems that many of the columns have the delimiter error. I've read through other posts on this, and many of them select the "problem" lines and deal with them separately. Unfortunately I don't think that is possible for me -- so it sounds like I'll have to do some preprocessing? 

 

 

ballardw
Super User

It might help to tell us exactly what the "delimiter error" is that you think you see.

All @Tom's code is doing is to create text we can with the information of where SAS found a record break. By truncating the lines generated you are removing the information needed.

Copy the LOG text and paste into a text box opened on the forum with the </> icon. That way we can see what the data step saw.

BTW if your data is sensitive then the text box will allow you to type over the characters. Use XXX for character values and 000 (zeroes) for numeric if needed.

 

 

claremc
Obsidian | Level 7

Thanks for the suggestion! Here is the complete log I get after running Tom's code: 

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 72         
 73         DATA RAW;
 74         INFILE 'FILENAME.csv' firstobs=2 obs=5;
 75         INPUT;
 76         list;
 77         Run;
 
 NOTE: The infile 'filename.csv' is:
       Filename=filename.csv,
       RECFM=V,LRECL=32767,File Size (bytes)=49471666,
       Last Modified=08Jul2021:02:23:25,
       Create Time=08Jul2021:07:15:26
 
 RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0                      
 2         0,123456,1234567,891011121314,2021-00-00 03:30:40,,2021-00-00 15:37:40,94500-6,TJU SARS-COV-2 RNA QUALI
      101  TATIVE NAAT,,Lab,,,POSITIVE,"REVERSE TRANSCRIPTION POLYMERASE CHAIN REACTION (RT-PCR) 185
 3         ",POSITIVE,F,84,0,XXXXX,XXXX,1900-00-00,(XXX)xxx-xxxx,1234 APPLE RD,,PHILADELPHIA,PA,
101 race,ENG,191,W,1144,Jefferson Ambulatory,,,,RACE,,,, 161 4 0,1359946,1345887,403041498,2020-11-24 10:00:00,,2020-11-24 10:00:00,94309-2,TJU SARS-COV-2 RNA QUAL 101 ITATIVE NAAT,,Lab,"","",POSITIVE,SARS-COV-2 RNA DETECTED. POSITIVE RESULTS DO NOT RULE OUT BACTERIA 201 L INFECTION OR CO-INFECTION WITH OTHER VIRUSES. | PENNSYLVANIA DEPARTMENT OF HEALTH WILL BE NOTIFIED 301 . 0X0D | ,POSITIVE,F,50,0,PATIENT NAME,1900-00-0,(xxx)XXX-xxxx,address,,PHILADELPHIA,PA, 401 race,,19136,,1168, University Hospital,"","","",Unknown,"","","","" 497 5 0,1359949,1345890,03083743,2021-01-09 08:21:00,,2021-01-09 08:21:19,94309-2,TJU SARS-COV-2 RNA QUALI 101 TATIVE NAAT,,Lab,,,POSITIVE,"SARS-COV-2 RNA DETECTED. POSITIVE RESULTS DO NOT RULE OUT BACTERIAL IN 201 FECTION OR CO-INFECTION WITH OTHER VIRUSES. 243 NOTE: 4 records were read from the infile 'M:\DDC\EPI\ACD\COVID\datasTest\HSX\covid19_phila_dph_2021-07-08.csv'. The minimum record length was 161. The maximum record length was 497. NOTE: The data set WORK.RAW has 4 observations and 0 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.01 seconds 78 79 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

As you can see, something is going on with the delimiters. Using the classic csv comma is not working. I'm just not sure how to get around this. 

ballardw
Super User

A brief discussion on how to read that log.

The line with Rule in the left margin is a mono-space character position record indicator. It has numeral for each 10 positions, the + is a 5 so you can find the column indicators from 1 to 100 (where the 0 appears in the Rule).

There are numbers in the extreme left that indicate the line of the file: 2,3,4 and 5.

To the right of those and left of the position of the first "rule" space indicate is a number when the line exceeds 100 characters. The 101 says the first character under the rule is at column 101 in the source file. If the line is long enough you see a 201, 301 etc. These numbers depend on some system settings. In some environment you might see 81 or 121 or similar. The number that appears after the text on the last log is the total length of a specific data line. For example the Line 2 has 185 total characters.

So looking at that text I would guess, just because of similarity of fields that lines 2, 4 and 5 (from the log left column) are the starts of records and line 3 is supposed to be part of line 2. Would I be correct in guessing that 5 terminates too early as well?

 

Let us know how good my guesses are.

 

The examples I'm seeing would show a line with an unbalanced quote on a line. I think there's an example of using the _infile_ automatic variable with that information somewhere on this forum. We'll have to look for it.

 

claremc
Obsidian | Level 7

Thanks for the clarification, that was helpful. And your guesses are correct. Lines 2, 4, and 5 are separate records.. 

 

I tried using the delimiter statement in the infile line "dlm='0A'x' after reading this documentation- http://support.sas.com/techsup/technote/ts673.pdf ,

but to no avail - 

filename rawinput "M:\DDC\EPI\ACD\COVID\datasTest\HSX\covid19_phila_dph_2021-07-08.csv" dsd termstr=cr;
DATA hsx;
INFILE rawinput firstobs=2 dlm='0A'x ;
INPUT var1 $ var2$;
list;
Run;
Tom
Super User Tom
Super User

So the file is using comma as the delimiter.  It looks like line number 2 has an extra end-of-line inserted just after the right parentheses.

 

Try running it again using CRLF as the TERMSTR= setting on the INFILE statement and see if line number 2 looks different.

DATA _null_;
  INFILE 'FILENAME.csv' firstobs=2 obs=5 termstr=crlf;
  INPUT;
  LIST;
RUN:

If it still breaks at the same place then you will need to fix the file.

 

Since it looks like the values with embedded end of line characters are quoted you can use the quote counting method.

For example by replace any CR or LF inside of quotes with pipe characters instead.

DATA _null_;
  INFILE 'FILENAME.csv' recfm=f lrecl=1;
  file 'newfilename.csv' recfm=f lrecl=1;
  input ch $char1. ;
  q+(ch='"');
  q=mod(q,2);
  if q then ch=translate(ch,'||','0D0A'x);
  put ch $char1. ;
run;
Tom
Super User Tom
Super User

@claremc Were you able to use that data step make a version of the CSV file that can be read by PROC IMPORT or SAS data step?

 

Here is a link to a SASware Ballot idea asking SAS to enhance support for this file format. Please upvote that ballot entry.

https://communities.sas.com/t5/SASware-Ballot-Ideas/Enhancements-to-INFILE-FILE-to-handle-delimited-...

 

claremc
Obsidian | Level 7

Sadly, I was not. Just upvoted the post! 

claremc
Obsidian | Level 7
Thanks for your help Tom. Looks like I'll need to try and get the people who send me this data to change the file 😞

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 2751 views
  • 1 like
  • 3 in conversation