Solved: Reading CSV file with double quotes, line breaks and spaces

appleorange · Posted 10-18-2018 03:53 PM

I have very large CSV files with 100+ variables that look something like this:

There are empty columns, and often, fields with long string such as addresses, business names, etc will have a line break interspersed throughout the files. Aside from deleting these problem columns, I can't figure out a good way to import these files completely. I've tried adding TERMSTR=CRLF but it doesn't help and causes even worse formatting.

Is there a better way to import and/or to clean up these line breaks/spaces?

id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress"8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic",""
"9","6/22/2014","First Name 2","Last Name 2","Below Status1
Level","","AB CD
Residency Clinic","""33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St
City, State","

For example:

data WORK.WEIRDFORMAT ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'weirdformat.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2
;
informat id $24. ;
informat EntryDate $11. ;
informat FirstName $14. ;
informat LastName $13. ;
informat Status1 $21. ;
informat Status2 $2. ;
informat Location $2. ;
informat OtherAddress $2. ;
format id $24. ;
format EntryDate $11. ;
format FirstName $14. ;
format LastName $13. ;
format Status1 $21. ;
format Status2 $2. ;
format Location $2. ;
format OtherAddress $2. ;
input
id $
EntryDate $
FirstName $
LastName $
Status1 $
Status2 $
Location $
OtherAddress $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;

Tom · Posted 10-18-2018 04:30 PM

Here is example of counting quotes. Let's make your example file with one line that has a couple of embedded line breaks.

filename test temp;
data _null_;
  file test;
  put
  'id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress'
 /'"8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic",""'
 /'"9","6/22/2014","First Name 2","Last Name 2","Below Status1'
 /'Level","","AB CD'
 /'Residency Clinic",""'
 /'"33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St City, State",""'
  ;
run;

Now let's run a step to copy it to a new file and replace those embedded line breaks with pipe characters (so we can see them).

filename fixed temp;
data _null_;
  infile test ;
  file fixed ;
  input;
  put _infile_ @;
  q+countc(_infile_,'"');
  if mod(q,2) then put '|' @;
  else put;
run;

Result:

61   data _null_;
62     infile fixed;
63     input;
64     list;
65   run;

NOTE: The infile FIXED is:
      Filename=.../#LN00087,
      Owner Name=...,Group Name=...,
      Access Permission=-rw-rw-r--,
      Last Modified=18Oct2018:16:26:40,
      File Size (bytes)=363

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9
1         id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress 69
2         "8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic","" 87
3         "9","6/22/2014","First Name 2","Last Name 2","Below Status1|Level","","AB CD|Residency Cli
      91  nic","" 97
4         "33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St
      91   City, State","" 106
NOTE: 4 records were read from the infile FIXED.
      The minimum record length was 69.
      The maximum record length was 106.

View solution in original post

Reeza · Posted 10-18-2018 03:58 PM

There's an answer on here that uses infile to replace all line feeds ('0A'x) I belive with a space. Figure out what your symbol is that's causing the issue, you'll need the hex code and then you can replace it in the line and then read it properly.

Tom · Posted 10-18-2018 04:12 PM

Try the method in the answer to this previous post, but update it to use comma instead of pipe.

https://communities.sas.com/t5/General-SAS-Programming/Carriage-Return-Multiple-Line-Break-Issue-Imp...

But watch out if any of the values actually contain commas as that would confuse the counting of delimiters.

The other way to deal with line breaks inside of quoted strings is to keep track of whether you have seen an even or odd number of quote characters. When it is odd then remove (or replace with space or something else) any line breaks. I am sure there is a question with than answer on this forum, but I couldn't find it just now.

appleorange · Posted 10-18-2018 04:28 PM

I think this might be the thread...

https://communities.sas.com/t5/General-SAS-Programming/Removing-embedded-carriage-returns/td-p/37873...

I'm not sure I understand the logic of this method but I'll look at this thread more.

Tom · Posted 10-18-2018 04:30 PM

Here is example of counting quotes. Let's make your example file with one line that has a couple of embedded line breaks.

filename test temp;
data _null_;
  file test;
  put
  'id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress'
 /'"8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic",""'
 /'"9","6/22/2014","First Name 2","Last Name 2","Below Status1'
 /'Level","","AB CD'
 /'Residency Clinic",""'
 /'"33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St City, State",""'
  ;
run;

Now let's run a step to copy it to a new file and replace those embedded line breaks with pipe characters (so we can see them).

filename fixed temp;
data _null_;
  infile test ;
  file fixed ;
  input;
  put _infile_ @;
  q+countc(_infile_,'"');
  if mod(q,2) then put '|' @;
  else put;
run;

Result:

61   data _null_;
62     infile fixed;
63     input;
64     list;
65   run;

NOTE: The infile FIXED is:
      Filename=.../#LN00087,
      Owner Name=...,Group Name=...,
      Access Permission=-rw-rw-r--,
      Last Modified=18Oct2018:16:26:40,
      File Size (bytes)=363

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9
1         id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress 69
2         "8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic","" 87
3         "9","6/22/2014","First Name 2","Last Name 2","Below Status1|Level","","AB CD|Residency Cli
      91  nic","" 97
4         "33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St
      91   City, State","" 106
NOTE: 4 records were read from the infile FIXED.
      The minimum record length was 69.
      The maximum record length was 106.

appleorange · Posted 10-18-2018 04:58 PM

Thanks for taking the time to share an example. I get that the I now shows where the line breaks are. I just realized by counting quotes it refers to that line breaks would cause odd numbers of quotes...sorry slow today.

I applied this to my dataset and ran a downstream step. It seems to have solved my issue but I'll do some more digging.

Satish_Parida · Posted 08-24-2020 02:14 AM

The only drawback to this is if the lrecl is more than 32767, then the _infile_ wont be able to handle the record.

Tom · Posted 08-24-2020 09:55 AM

@Satish_Parida wrote:
The only drawback to this is if the lrecl is more than 32767, then the _infile_ wont be able to handle the record.

If the file can have longer lines you need to read/write it character by character and add code to detect and remove the end of line character(s).

ballardw · Posted 10-18-2018 04:16 PM

A quick try would be to add the option TERMSTR=CRLF

If the line breaks are caused by a single character, either a carriage return or a line feed this option might be able to keep a record together until both are together.

Might.

The quotes should not be an issue otherwise assuming the informat is correct for the values.

If you used proc import to generate the code make sure that you used a large value for guessingrows to have a better chance of setting the informats correctly, or at least usable.

appleorange · Posted 10-18-2018 04:19 PM

Yes I tried that and it actually caused more issues. For a dataset that is supposed to have 1800 rows...it returned 52 rows.

ballardw · Posted 10-18-2018 07:26 PM

@appleorange wrote:
Yes I tried that and it actually caused more issues. For a dataset that is supposed to have 1800 rows...it returned 52 rows.

Would I be correct in assuming the data started out in Excel where people manually entered data? This is the sort of thing you get to deal with from the use of ALT-Enter to break text in cells. Sometimes the TERMSTR can help. others not.

appleorange · Posted 10-18-2018 09:32 PM

A vendor collects the data that is manually entered by people through iPads or laptops. Maybe their software/data collection system lets people type in string like excel. It’s very annoying!

ballardw · Posted 10-19-2018 12:00 PM

@appleorange wrote:
A vendor collects the data that is manually entered by people through iPads or laptops. Maybe their software/data collection system lets people type in string like excel. It’s very annoying!

So it is an open text field that they might consider cleaning up when they process it into their data base.

TERMSTR wouldn't work consistently because you are possibly getting a mix of CR only (IPAD or Apple laptops) and CRLF (windows laptops) results for typing an Enter or similar key( and likely some folks were entering two or more to make their display "look nice" when entering text). I have no idea about Android devices but you may also be getting some LF only as well.

Tom · Posted 10-19-2018 12:26 PM

Some time in the last year I entered a enhancement request to have SAS provide more options for handling delimited files. Including the support for embedded line breaks in quoted field values. But also other variations including the use of escape characters instead of quoting for embedded delimiters.

https://communities.sas.com/t5/SASware-Ballot-Ideas/Enhancements-to-INFILE-FILE-to-handle-delimited-...

appleorange · Posted 10-19-2018 04:00 PM

Well, I can definitely see the need...it would definitely be much better than having to pre-process many CSVs.

Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

Re: Reading CSV file with double quotes, line breaks and spaces

SAS Innovate 2025: Call for Content