I have very large CSV files with 100+ variables that look something like this:
There are empty columns, and often, fields with long string such as addresses, business names, etc will have a line break interspersed throughout the files. Aside from deleting these problem columns, I can't figure out a good way to import these files completely. I've tried adding TERMSTR=CRLF but it doesn't help and causes even worse formatting.
Is there a better way to import and/or to clean up these line breaks/spaces?
id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress"8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic",""
"9","6/22/2014","First Name 2","Last Name 2","Below Status1
Level","","AB CD
Residency Clinic","""33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St
City, State","
For example:
data WORK.WEIRDFORMAT ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'weirdformat.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2
;
informat id $24. ;
informat EntryDate $11. ;
informat FirstName $14. ;
informat LastName $13. ;
informat Status1 $21. ;
informat Status2 $2. ;
informat Location $2. ;
informat OtherAddress $2. ;
format id $24. ;
format EntryDate $11. ;
format FirstName $14. ;
format LastName $13. ;
format Status1 $21. ;
format Status2 $2. ;
format Location $2. ;
format OtherAddress $2. ;
input
id $
EntryDate $
FirstName $
LastName $
Status1 $
Status2 $
Location $
OtherAddress $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
Here is example of counting quotes. Let's make your example file with one line that has a couple of embedded line breaks.
filename test temp;
data _null_;
file test;
put
'id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress'
/'"8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic",""'
/'"9","6/22/2014","First Name 2","Last Name 2","Below Status1'
/'Level","","AB CD'
/'Residency Clinic",""'
/'"33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St City, State",""'
;
run;
Now let's run a step to copy it to a new file and replace those embedded line breaks with pipe characters (so we can see them).
filename fixed temp;
data _null_;
infile test ;
file fixed ;
input;
put _infile_ @;
q+countc(_infile_,'"');
if mod(q,2) then put '|' @;
else put;
run;
Result:
61 data _null_; 62 infile fixed; 63 input; 64 list; 65 run; NOTE: The infile FIXED is: Filename=.../#LN00087, Owner Name=...,Group Name=..., Access Permission=-rw-rw-r--, Last Modified=18Oct2018:16:26:40, File Size (bytes)=363 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9 1 id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress 69 2 "8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic","" 87 3 "9","6/22/2014","First Name 2","Last Name 2","Below Status1|Level","","AB CD|Residency Cli 91 nic","" 97 4 "33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St 91 City, State","" 106 NOTE: 4 records were read from the infile FIXED. The minimum record length was 69. The maximum record length was 106.
Try the method in the answer to this previous post, but update it to use comma instead of pipe.
But watch out if any of the values actually contain commas as that would confuse the counting of delimiters.
The other way to deal with line breaks inside of quoted strings is to keep track of whether you have seen an even or odd number of quote characters. When it is odd then remove (or replace with space or something else) any line breaks. I am sure there is a question with than answer on this forum, but I couldn't find it just now.
Here is example of counting quotes. Let's make your example file with one line that has a couple of embedded line breaks.
filename test temp;
data _null_;
file test;
put
'id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress'
/'"8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic",""'
/'"9","6/22/2014","First Name 2","Last Name 2","Below Status1'
/'Level","","AB CD'
/'Residency Clinic",""'
/'"33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St City, State",""'
;
run;
Now let's run a step to copy it to a new file and replace those embedded line breaks with pipe characters (so we can see them).
filename fixed temp;
data _null_;
infile test ;
file fixed ;
input;
put _infile_ @;
q+countc(_infile_,'"');
if mod(q,2) then put '|' @;
else put;
run;
Result:
61 data _null_; 62 infile fixed; 63 input; 64 list; 65 run; NOTE: The infile FIXED is: Filename=.../#LN00087, Owner Name=...,Group Name=..., Access Permission=-rw-rw-r--, Last Modified=18Oct2018:16:26:40, File Size (bytes)=363 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9 1 id,EntryDate,FirstName,LastName,Status1,Status2,Location,OtherAddress 69 2 "8","5/31/2013","First Name 1","Last Name 1","Below Status1 Level","","ABC D clinic","" 87 3 "9","6/22/2014","First Name 2","Last Name 2","Below Status1|Level","","AB CD|Residency Cli 91 nic","" 97 4 "33","9/3/2015","First Name 3","Last Name 3","Below Status1 Level","","","","100 Random St 91 City, State","" 106 NOTE: 4 records were read from the infile FIXED. The minimum record length was 69. The maximum record length was 106.
@Satish_Parida wrote:
The only drawback to this is if the lrecl is more than 32767, then the _infile_ wont be able to handle the record.
If the file can have longer lines you need to read/write it character by character and add code to detect and remove the end of line character(s).
A quick try would be to add the option TERMSTR=CRLF
If the line breaks are caused by a single character, either a carriage return or a line feed this option might be able to keep a record together until both are together.
Might.
The quotes should not be an issue otherwise assuming the informat is correct for the values.
If you used proc import to generate the code make sure that you used a large value for guessingrows to have a better chance of setting the informats correctly, or at least usable.
@appleorange wrote:
Yes I tried that and it actually caused more issues. For a dataset that is supposed to have 1800 rows...it returned 52 rows.
Would I be correct in assuming the data started out in Excel where people manually entered data? This is the sort of thing you get to deal with from the use of ALT-Enter to break text in cells. Sometimes the TERMSTR can help. others not.
@appleorange wrote:
A vendor collects the data that is manually entered by people through iPads or laptops. Maybe their software/data collection system lets people type in string like excel. It’s very annoying!
So it is an open text field that they might consider cleaning up when they process it into their data base.
TERMSTR wouldn't work consistently because you are possibly getting a mix of CR only (IPAD or Apple laptops) and CRLF (windows laptops) results for typing an Enter or similar key( and likely some folks were entering two or more to make their display "look nice" when entering text). I have no idea about Android devices but you may also be getting some LF only as well.
Some time in the last year I entered a enhancement request to have SAS provide more options for handling delimited files. Including the support for embedded line breaks in quoted field values. But also other variations including the use of escape characters instead of quoting for embedded delimiters.
Well, I can definitely see the need...it would definitely be much better than having to pre-process many CSVs.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.