BookmarkSubscribeRSS Feed

There are some common variations on how to create delimited files used by data software, like Excel and Redshift, that are currently not directly supported by SAS.  Of these there are two main ones that I think should be easy to implement and solve the most common problems.

 

Support for reading and writing files with embedded end-of-line characters. 

This is common for CSV files generated by EXCEL.  Currently SAS can only handle files where the embedded line break characters do not exactly match the end-of-line characters.  So SAS can read a file that uses CR+LF as end of line and has embedded CR or LF characters, but not records that have embedded CR+LF combinations.  But when the field is properly quoted this is a format that Excel (and others) do support. 

 

This one is especially annoying since it is difficult to program around this limitation.  Currently to read files like this users need to pre-process the whole file to remove or replace the embedded line break characters.  And for writing files SAS does not currently automatically add quotes around field values that contain the end-of-line characters.  So users might need to use the ~ modify on PUT statements to force SAS to add quotes and end up with files that have more quotes than are really needed.

 

Support for the use of an escape character

To protect special characters like delimiters, CR, LF, quotes and the escape character itself a lot of databases (and programming languages) use an escape character, typically a backslash.  This method of protecting special characters in text is popular with many languages, for example Unix command shells, and has been adopted by major database platforms like Redshift.  Typically this is done instead of adding quotes around values that contain special characters. Some even try to add quotes around the values but instead of doubling the quotes that are in the data they add this backslash character.

SAS should be able to read a delimited file that is using an escape character instead of quoting to protect embedded special characters. And it should be able to write files that use escaping instead of quoting to protect special characters.

11 Comments
barefootguru
Fluorite | Level 6

According to RFC 4180, returns are valid inside a quoted field, so SAS is the odd man out here.  We have to preprocess CSV files to strip out those characters so SAS can read them correctly — fixing this would be great.

Fields containing line breaks (CRLF), double quotes, and commas
       should be enclosed in double-quotes.
ChrisNZ
Tourmaline | Level 20

Upvoted. 

 

Any statistics on how many ballot proposals become production?

My guess is epsilon.

 

The main additions to the language (such as the CAT functions?) seem to originate elsewhere.

My guess is these improvements come when SAS Institute developers need them to write something else (packages such as the AML or Model Manager) or when there is a threatened sale.

 

 

BeverlyBrown
Community Manager
Status changed to: Not Planned

Thanks for your idea. Some digging uncovered evidence that it kicked off significant internal deliberation. For anyone running across this thread later, here's the reference material @barefootguru mentioned: Tips for Reading Data from CSV Files (scroll down a bit to see the section on Compliance with RFC 4180) and a SAS Note: http://support.sas.com/kb/26/065.html.

Tom
Super User
Super User

That is too bad it would be much better to have the functionality built in.

 

Can you fix the first link or attach the document? As posted it does not work. So that first link is just a link explaining how the current process works. How is that helpful?

 

Can you also post solutions for the other types of formats mentioned?  Like read or writing files that use escape characters, like backslash, instead of quoting.  Or adding quotes around values with embedded end of line characters when writing a delimited file.

 

Kurt_Bremser
Super User

I think I'll refrain from wasting my time with the SASWare Ballots in the future. If that much support does not make it into development, why bother?

ChrisNZ
Tourmaline | Level 20

@Kurt_Bremser  I sometime have the feeling that community contributions are welcome to keep the web site alive and the users' questions answered, but when as far as the product itself is is concerned it's a different story. Such a waste of feedback from actual -competent, the most competent even, and the most engaged- users. What better sources do SAS Institute have for planning improvements?

ChrisNZ
Tourmaline | Level 20

@BeverlyBrown 
This deviation enables parallel processing of CSV files from Hadoop.

That is a valid reason in the context given in the link. But

1. SAS is using the file here, not Hadoop, so having SAS use a Hadoop-compliant logic is of limited value.

2. There should be an option to turn this on or off. 

3. Hadoop has an active developer community that provides quality solutions to common problems. Such as SerDe and opencsv.sourceforge.net in this case. SAS on the other hand can only use tools coming from SAS Institute. So the onus is on SAS Institute to provide a way to easily read such -very common- files.

 

 

BeverlyBrown
Community Manager

Hi @ChrisNZ, I made sure some colleagues in Tech Support and R&D saw this thread, and one thought this documentation might help. While we appreciate all suggestions, they don’t always relate to features under active development, though they may be actively used and supported. As for your and Kurt’s critiques, we hear you. There’s always room to improve on closing the loop with those of you so generous with your feedback. Thank you for keeping us on our toes.

 

ChrisNZ
Tourmaline | Level 20

Hi Beverly, Thank you for your kind reply.

Does the link provided indicate that the HADOOP Access Method treats files differently when treating embedded CR or LF characters?

Tom
Super User
Super User

I have created a pair of SAS macros to at least help a little with the first problem, embedded end of line characters.

 

The first macro, %safe_ds2csv(), is for generating a CSV file from a SAS dataset that will make sure that values which contain CR or LF characters are enclosed in double quotes in the file.  It will also use CRLF combination as the end of line markers in the file and replace any CRLF pairs in the data with just CR instead.  In addition it will ensure that any values that contain single quote (apostrophe) characters are also quoted in the file to prevent another issue such characters can cause when read by SAS data step.

https://github.com/sasutils/macros/blob/master/safe_ds2csv.sas 

 

The second macro, %replace_crlf(), is to handle existing CSV files that have embedded CR or LF characters and that are quoted.  It will create a new CSV file and allow you to specify characters to replace the CR or LF characters that are inside quotes (or simply remove them).

https://github.com/sasutils/macros/blob/master/replace_crlf.sas 

 

Hopefully this will help until SAS actually makes some enhancements to the language for dealing with such files.