BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
appleorange
Obsidian | Level 7

I have very large CSV files with 100+ variables that look something like this:

 

There are empty columns, and often, fields with long string such as addresses, business names, etc will have a line break interspersed throughout the files.  Aside from deleting these problem columns, I can't figure out a good way to import these files completely.  I've tried adding TERMSTR=CRLF but it doesn't help and causes even worse formatting.

 

Is there a better way to import and/or to clean up these line breaks/spaces?

 

id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress"8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic",""
"9","6/22/2014","First Name 2","Last Name 2","Below Status1
Level","","AB CD
Residency Clinic","""33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St
City, State","

 

For example:

 

 data WORK.WEIRDFORMAT ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'weirdformat.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2
;
informat id $24. ;
informat EntryDate $11. ;
informat FirstName $14. ;
informat LastName $13. ;
informat Status1 $21. ;
informat Status2 $2. ;
informat Location $2. ;
informat OtherAddress $2. ;
format id $24. ;
format EntryDate $11. ;
format FirstName $14. ;
format LastName $13. ;
format Status1 $21. ;
format Status2 $2. ;
format Location $2. ;
format OtherAddress $2. ;
input
id $
EntryDate $
FirstName $
LastName $
Status1 $
Status2 $
Location $
OtherAddress $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Here is example of counting quotes.  Let's make your example file with one line that has a couple of embedded line breaks.

filename test temp;
data _null_;
  file test;
  put
  'id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress'
 /'"8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic",""'
 /'"9","6/22/2014","First Name 2","Last Name 2","Below Status1'
 /'Level","","AB CD'
 /'Residency Clinic",""'
 /'"33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St City, State",""'
  ;
run;

Now let's run a step to copy it to a new file and replace those embedded line breaks with pipe characters (so we can see them).

filename fixed temp;
data _null_;
  infile test ;
  file fixed ;
  input;
  put _infile_ @;
  q+countc(_infile_,'"');
  if mod(q,2) then put '|' @;
  else put;
run;

Result:

61   data _null_;
62     infile fixed;
63     input;
64     list;
65   run;

NOTE: The infile FIXED is:
      Filename=.../#LN00087,
      Owner Name=...,Group Name=...,
      Access Permission=-rw-rw-r--,
      Last Modified=18Oct2018:16:26:40,
      File Size (bytes)=363

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9
1         id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress 69
2         "8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic","" 87
3         "9","6/22/2014","First Name 2","Last Name 2","Below Status1|Level","","AB CD|Residency Cli
      91  nic","" 97
4         "33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St
      91   City, State","" 106
NOTE: 4 records were read from the infile FIXED.
      The minimum record length was 69.
      The maximum record length was 106.

View solution in original post

18 REPLIES 18
Reeza
Super User
There's an answer on here that uses infile to replace all line feeds ('0A'x) I belive with a space. Figure out what your symbol is that's causing the issue, you'll need the hex code and then you can replace it in the line and then read it properly.
Tom
Super User Tom
Super User

Try the method in the answer to this previous post, but update it to use comma instead of pipe.

 

https://communities.sas.com/t5/General-SAS-Programming/Carriage-Return-Multiple-Line-Break-Issue-Imp...

 

But watch out if any of the values actually contain commas as that would confuse the counting of delimiters.

 

The other way to deal with line breaks inside of quoted strings is to keep track of whether you have seen an even or odd number of quote characters.  When it is odd then remove (or replace with space or something else) any line breaks.  I am sure there is a question with than answer on this forum, but I couldn't find it just now.

appleorange
Obsidian | Level 7
I think this might be the thread...

https://communities.sas.com/t5/General-SAS-Programming/Removing-embedded-carriage-returns/td-p/37873...

I'm not sure I understand the logic of this method but I'll look at this thread more.
Tom
Super User Tom
Super User

Here is example of counting quotes.  Let's make your example file with one line that has a couple of embedded line breaks.

filename test temp;
data _null_;
  file test;
  put
  'id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress'
 /'"8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic",""'
 /'"9","6/22/2014","First Name 2","Last Name 2","Below Status1'
 /'Level","","AB CD'
 /'Residency Clinic",""'
 /'"33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St City, State",""'
  ;
run;

Now let's run a step to copy it to a new file and replace those embedded line breaks with pipe characters (so we can see them).

filename fixed temp;
data _null_;
  infile test ;
  file fixed ;
  input;
  put _infile_ @;
  q+countc(_infile_,'"');
  if mod(q,2) then put '|' @;
  else put;
run;

Result:

61   data _null_;
62     infile fixed;
63     input;
64     list;
65   run;

NOTE: The infile FIXED is:
      Filename=.../#LN00087,
      Owner Name=...,Group Name=...,
      Access Permission=-rw-rw-r--,
      Last Modified=18Oct2018:16:26:40,
      File Size (bytes)=363

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9
1         id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress 69
2         "8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic","" 87
3         "9","6/22/2014","First Name 2","Last Name 2","Below Status1|Level","","AB CD|Residency Cli
      91  nic","" 97
4         "33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St
      91   City, State","" 106
NOTE: 4 records were read from the infile FIXED.
      The minimum record length was 69.
      The maximum record length was 106.
appleorange
Obsidian | Level 7
Thanks for taking the time to share an example. I get that the I now shows where the line breaks are. I just realized by counting quotes it refers to that line breaks would cause odd numbers of quotes...sorry slow today.

I applied this to my dataset and ran a downstream step. It seems to have solved my issue but I'll do some more digging.
Satish_Parida
Lapis Lazuli | Level 10
The only drawback to this is if the lrecl is more than 32767, then the _infile_ wont be able to handle the record.
Tom
Super User Tom
Super User

@Satish_Parida wrote:
The only drawback to this is if the lrecl is more than 32767, then the _infile_ wont be able to handle the record.

If the file can have longer lines you need to read/write it character by character and add code to detect and remove the end of line character(s).

ballardw
Super User

A quick try would be to add the option TERMSTR=CRLF

 

If the line breaks are caused by a single character, either a carriage return or a line feed this option might be able to keep a record together until both are together.

 

Might.

 

The quotes should not be an issue otherwise assuming the informat is correct for the values.

 

If you used proc import to generate the code make sure that you used a large value for guessingrows to have a better chance of setting the informats correctly, or at least usable.

appleorange
Obsidian | Level 7
Yes I tried that and it actually caused more issues. For a dataset that is supposed to have 1800 rows...it returned 52 rows.

ballardw
Super User

@appleorange wrote:
Yes I tried that and it actually caused more issues. For a dataset that is supposed to have 1800 rows...it returned 52 rows.


Would I be correct in assuming the data started out in Excel where people manually entered data? This is the sort of thing you get to deal with from the use of ALT-Enter to break text in cells. Sometimes the TERMSTR can help. others not.

appleorange
Obsidian | Level 7
A vendor collects the data that is manually entered by people through iPads or laptops. Maybe their software/data collection system lets people type in string like excel. It’s very annoying!
ballardw
Super User

@appleorange wrote:
A vendor collects the data that is manually entered by people through iPads or laptops. Maybe their software/data collection system lets people type in string like excel. It’s very annoying!

So it is an open text field that they might consider cleaning up when they process it into their data base.

 

 TERMSTR wouldn't work consistently because you are possibly getting a mix of CR only (IPAD or Apple laptops) and CRLF (windows laptops) results for typing an Enter or similar key( and likely some folks were entering two or more to make their display "look nice" when entering text). I have no idea about Android devices but you may also be getting some LF only as well.

Tom
Super User Tom
Super User

Some time in the last year I entered a enhancement request to have SAS provide more options for handling delimited files. Including the support for embedded line breaks in quoted field values.  But also other variations including the use of escape characters instead of quoting for embedded delimiters.

https://communities.sas.com/t5/SASware-Ballot-Ideas/Enhancements-to-INFILE-FILE-to-handle-delimited-...

 

appleorange
Obsidian | Level 7

Well, I can definitely see the need...it would definitely be much better than having to pre-process many CSVs.  

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 18 replies
  • 17288 views
  • 0 likes
  • 6 in conversation