DATA Step, Macro, Functions and more

Read CSV file with embedded CRLF

Accepted Solution Solved
Reply
Super Contributor
Posts: 376
Accepted Solution

Read CSV file with embedded CRLF

Hi,

I have a Powershell script which dynamically queries an Excel file and spits it out as a CSV (or other delimited) file.  Think of it as dynamically saving the Excel file as a CSV file at runtime.  Note:  we do not have SAS/ACCESS to PC File formats licensed.

Occasionally that Excel file will have an embedded CRLF (Alt-Enter within an Excel cell).  This causes the CSV record to have an unwanted CRLF "in the middle", which messes up the data import into SAS.

Is there any (easy) way I can read such a file correctly?  I can easily change my script to output an arbitrary end-of-record (EOR) marker, for example a caret (^) or tilde (~) - some agreed character that cannot be present in any of the data. 

So, for example, my CSV file might then look like:

VAR1, VAR2, VAR3

A,B,C^<CRLF>

D,E,F^<CRLF>

G,H<CRLF>

I<CRLF>

J,K^<CRLF>

In this example, I would want my EOR to be ^<CRLF>, and my SAS dataset to look like:

VAR1 VAR2 VAR3

A    B    C

D    E    F

G    H    K

     I

     J

The only ideas I can think of are:  1) INFILE TERMSTR=..., but it doesn't look like I can specify an arbitrary EOR indicator (unlike say Perl), or 2) RECFM=N, but I think (???) I'd have to write my own parser, since the file is now treated as a byte stream.  This would be "messy".

Thanks for any suggestions re: how to tackle this issue.

Regards,

Scott


Accepted Solutions
Solution
‎11-23-2014 11:21 PM
PROC Star
Posts: 7,364

Re: Read CSV file with embedded CRLF

Scott,

Without seeing the file one can't be sure, but Alt-Enters in Excel are typically represented by the hex character '0A'x;

If you define your file with a filename statement, that includes a termstr=crlf, I think you can parse the field using '0A'x as the delimiter, and then input the separate entries like you would any comma delimited field.

View solution in original post


All Replies
Solution
‎11-23-2014 11:21 PM
PROC Star
Posts: 7,364

Re: Read CSV file with embedded CRLF

Scott,

Without seeing the file one can't be sure, but Alt-Enters in Excel are typically represented by the hex character '0A'x;

If you define your file with a filename statement, that includes a termstr=crlf, I think you can parse the field using '0A'x as the delimiter, and then input the separate entries like you would any comma delimited field.

Super Contributor
Posts: 376

Re: Read CSV file with embedded CRLF

Hi,

I've got some working code, but welcome feedback whether it's the best approach.

What I did:

data have;

  informat Sales Inventory Returns;

  set sashelp.shoes;

  where subsidiary="Addis Ababa";

run;

In DMS, RMB and view in Excel (or export to Excel some other way)

I deleted the first couple rows, saved as Excel xlsx, added Alt-Enter in some Products, saved that as Excel xlsx.

Ran my Powershell script to create two CSV files (see attached).  This output is slightly different than Excel's SaveAs CSV (wrt quoted fields)

The SAS doc needs to be improved with respect to the TERMSTR option.  It implies a limited set of values (CR, CRLF, etc).  I tried an explicit EOR character, which seemed to work.  BTW, I'm using SAS 9.3 on Windows 7.

Here is my "working" code, further feedback welcome:

data test1;

  infile "c:\temp\temp1.csv" dsd dlm="|" termstr="^" truncover firstobs=2;

  input @@;

  putlog _infile_;

  putlog _infile_ hex128. /;  * note the leading CRLF "0D0A"x ;

  * option 1: remove all CRLF - this is the better option ;

_infile_=compress(_infile_,"0D0A"x);  * if you comment this out you'll get an error ;

  * option 2: remove only leading CRLF - this can cause problems with later where processing ;

_infile_=prxchange("s/^\x0D\x0A//",-1,_infile_);

  putlog _infile_ hex128. /;  * leading CRLF now removed ;

  input Sales Inventory Returns Region ~ $10. Product ~ $20. Subsidiary ~ $20. Stores;

  format Sales -- Returns dollar12.;

  if missing(Region) then delete;

run;

data test2;

  infile "c:\temp\temp2.csv" dsd dlm="|" termstr="^" truncover firstobs=2;

  input @@;

  putlog _infile_;

  putlog _infile_ hex128. /;  * note the leading CRLF "0D0A"x ;

  * option 1: remove all CRLF - this is the better option ;

_infile_=compress(_infile_,"0D0A"x);  * if you comment this out you'll get an error ;

  * option 2: remove only leading CRLF - this can cause problems with later where processing ;

_infile_=prxchange("s/^\x0D\x0A//",-1,_infile_);

  putlog _infile_ hex128. /;  * leading CRLF now removed ;

  input Sales Inventory Returns Region ~ $10. Product ~ $20. Subsidiary ~ $20. Stores;

  format Sales -- Returns dollar12.;

  if missing(Region) then delete;

run;

Attachment
Attachment
Valued Guide
Posts: 3,208

Re: Read CSV file with embedded CRLF

The real question is whether the way you are trying to solve the cr/lf issue with a double meaning / intrepretation is not causing more problems as it will solve.

As it is some form of input validations you could do that with excel before handing over to SAS.  3 ways to remove carriage returns in Excel: formulas, VBA macro, find&replace dialog

It is the question on how the disable hard enter in excel before going into how to process that in SAS.

It is the same story why using a comma is a bad approach in an I18N world (decimal points differ). The tab being used (binary limitatioen).
The choice of using special chars for some special meaning will always be have some confusing moments. (brackets in XML)

Using fixed lengt type records or having a field with the length of each the record do not have those side-effects. Just a pitty they are not common standardized formats.

Not that amazing googling (hard enter disable) this popped up (unknown quality) How to quickly remove line breaks in Excel? It is a commercial offer addins for excel. There must be more.

---->-- ja karman --<-----
Super User
Super User
Posts: 7,413

Re: Read CSV file with embedded CRLF

I had the same issue and went through Tech Support for resolution: https://communities.sas.com/thread/60374

Super Contributor
Posts: 376

Re: Read CSV file with embedded CRLF

I don't follow how this is the same issue, although I appreciate the reply.

Super User
Posts: 6,963

Re: Read CSV file with embedded CRLF

Could you make your VBA script command Excel to save the "character" cells with surrounding quotes? If yes, I have written a short C program that removes linefeeds when the number of quotes encountered is odd (indicating that there is an "open string")

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 9,687

Re: Read CSV file with embedded CRLF

How about this :

data test2;
  infile "c:\temp\temp2.csv" dsd dlm="|^" recfm=n;
  input x : $40. @@;
  n=mod(_n_,7); 
  if n=1 then group+1;
run;
proc transpose data=test2 out=want2(drop=_:) prefix=var_;
by group;
id n;
var x;
run;


Xia Keshan

Super User
Super User
Posts: 6,502

Re: Read CSV file with embedded CRLF

First if you generate the CSV file with Excel then it should have the strings with embedded CRLF properly quoted and you can use the quote counting tricks from other threads on this topic to convert the CRLF to single CR or LF or some other special character and then treat the converted file as normal.

Second you should be able to use the ^ in your example as the TERMSTR value on the INPUT statement.  The problem you will have is that after the first line the CRLF pairs at the ends of your lines (after the '^' ) will appear at the beginning of every line after the first.  You could probably just read that into a dummy variable.

filename x temp;

data _null_;

  file x ;

  put

'A,B,C^' '0D0A'x

'D,E,F^' '0D0A'x

'G,H' '0D0A'x

'I' '0D0A'x

'J,K^'  @;

run;

data _null_;

infile x termstr='^' dsd truncover ;

length x1-x4 $10 ;

input x1-x4 ;

list ;

put (x1-x4) (= $hex20.);

run;

Respected Advisor
Posts: 3,777

Re: Read CSV file with embedded CRLF

Alt-Enter in EXCEL inserts LF '0A'x not CRLF.  If you your records are DOS CRLF delimited then specifying INFILE option TERMSTR=CRLF should allow you to read the file without further problems.

Otherwise you don't have what you think you have, etc.

Super Contributor
Posts: 358

Re: Read CSV file with embedded CRLF

Hi

I have has success with the IGNOREDOSEOF option when reading the data.  Just a thought.

Super Contributor
Posts: 376

Re: Read CSV file with embedded CRLF

Hi All,

Thanks for all the replies.  Here is further detail; apologies if it's cryptic...

  • My Powershell script to query Excel is here:  Powershell/Query-Excel.ps1 at master · scottbass/Powershell · GitHub
  • Example invocations are:
    • .\Query-Excel.ps1 C:\Temp\Temp1.xlsx "select * from [Sheet1$]" -csv -dlm "|"
    • .\Query-Excel.ps1 C:\Temp\Temp1.xlsx "select * from [Sheet1$]" -csv -dlm "|" | % {"$_^"}  # to append an EOR marker to each record
    • .\Query-Excel.ps1 C:\Temp\Temp1.xlsx "select * from [Sheet1$]" -csv -dlm "|" | % {"$_^"} | Out-File C:\Temp\Temp1.csv -Encoding Ascii  # to create an output CSV file (could also use redirection > )
  • Example SAS invocation:

%let powershell = C:\Windows\syswow64\WindowsPowerShell\v1.0\powershell.exe;

%let psoptions  = -noprofile -executionpolicy unrestricted -command;

%let script     = E:\Powershell\Scripts\Query-Excel.ps1;

%let delimiter  = |;  %* do not use dlm as a macro variable name! ;

%let outoptions = -encoding ASCII;

%let worksheet  = Sheet1;

%let select     = select * from [&worksheet$];

  options noquotelenmax;

  data _null_;

    infile

"&powershell &psoptions ""&script -path '&excel_file' -query '&select' -delimiter '&delimiter' |

Out-File '&outfile' &outoptions

"" 2>&1"

    pipe lrecl=1000;

    input;

    putlog "WAR" "NING: " _infile_; %* should never occur unless script has an error ;

  run;

  options quotelenmax;


Obviously you can also invoke this from the Powershell console window to test/debug.

  • Save the output as a "normal" CSV
  • Use PROC IMPORT to generate a data step to import the CSV
  • Modify to suit - this is now your data import step
  • Run the SAS code above to dynamically convert the Excel file to CSV (I write the CSV file as a temp file in the work directory.  I could also read the pipe directly, but using a physical file is easier to debug.)
  • Run the PROC IMPORT code to import as a SAS dataset
  • Add embedded Alt-Enter into your Excel file, rinse and repeat.
  • The embedded Alt-Enter will mess up the import of the Excel --> CSV data

So, I'll need to modify my PROC IMPORT generated data step code to cater for the embedded Alt-Enter (0Ax) characters.  My 2nd post is my first attempt at this code.

IMO, a giant step in the right direction is a user-configuration EOR character.  Here is an except from the SAS documentation:

TERMSTR=

specifies the end-of-line character for the file. Use this option to share files between the UNIX and Windows operating environments. If termstr is not specified, a single LF or a CRLF function as the end of line character. If termstr=CRLF, then CRLF functions as the EOL character. The following are under Windows:

CRLF

Carriage return line feed. Use TERMSTR=CRLF to read Windows or DOS formatted files. CRLF is the default.

LF

Line feed. Use TERMSTR=LF to read UNIX formatted files. If a file contains CRLF characters, the CR functions as part of the data and not an end of line character.

CR

Carriage Return. Use TERMSTR=CR if the end of line character is a CR.

The implication is that these are the only settings available.  However, TERMSTR="^" worked for me without causing an error.  IMO the SAS documentation should be expanded and clarified.

I'm not going to use VBA or C approaches, although I appreciate the suggestions.

IGNOREDOSEOF also looks promising.  However, here is the SAS 9.3.doc:

IGNOREDOSEOF

is used in the context of I/O operations on variable record format files. When this option is specified, any occurrence of ^Z is interpreted as character data and not as an end-of-file marker.

And here is a Google hit:

45522 - IGNOREDOSEOF option not honored when specified on the FILENAME statement

Is ^Z the same as 0Ax?  From the SAS doc, I wouldn't have thought to use IGNOREDOSEOF for this issue.

Update: 

IGNOREDOSEOF did not fix the issue:

827  data foo;

828    infile "c:\temp\temp2.csv" dsd dlm="|" truncover firstobs=2 ignoredoseof;

829    input Sales Inventory Returns Region ~ $10. Product ~ $20. Subsidiary ~ $20. Stores;

830    format Sales -- Returns dollar12.;

831    if missing(Region) then delete;

832  run;

NOTE: The infile "c:\temp\temp2.csv" is:

      Filename=c:\temp\temp2.csv,

      RECFM=V,LRECL=256,File Size (bytes)=465,

      Last Modified=25 November 2014 11:50:47,

      Create Time=24 November 2014 15:38:51

NOTE: Invalid data for Sales in line 4 1-6.

NOTE: Invalid data for Inventory in line 4 8-18.

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+---

4         Casual|Addis Ababa|4 20

Sales=. Inventory=. Returns=$4 Region=  Product=  Subsidiary=  Stores=. _ERROR_=1 _N_=3

NOTE: Invalid data for Sales in line 6 1-5.

NOTE: Invalid data for Inventory in line 6 7-17.

6         Dress|Addis Ababa|7 19

Sales=. Inventory=. Returns=$7 Region=  Product=  Subsidiary=  Stores=. _ERROR_=1 _N_=5

NOTE: Invalid data for Sales in line 11 1-6.

NOTE: Invalid data for Inventory in line 11 8-18.

11        Casual|Addis Ababa|2 20

Sales=. Inventory=. Returns=$2 Region=  Product=  Subsidiary=  Stores=. _ERROR_=1 _N_=10

NOTE: Invalid data for Sales in line 13 1-5.

NOTE: Invalid data for Inventory in line 13 7-17.

13        Dress|Addis Ababa|12 20

Sales=. Inventory=. Returns=$12 Region=  Product=  Subsidiary=  Stores=. _ERROR_=1 _N_=12

NOTE: 12 records were read from the infile "c:\temp\temp2.csv".

      The minimum record length was 19.

      The maximum record length was 47.

NOTE: The data set WORK.FOO has 8 observations and 7 variables.

NOTE: DATA statement used (Total process time):

      real time           0.03 seconds

      cpu time            0.03 seconds

The below code is still my best approach so far, although it admittedly has the overhead of appending the EOR indicator to the CSV file, which might not work in other scenarios.

data foo;

  infile "c:\temp\temp2.csv" dsd dlm="|" termstr="^" truncover firstobs=2;

  input @@;

_infile_=compress(_infile_,"0D0A"x);

  input Sales Inventory Returns Region ~ $10. Product ~ $20. Subsidiary ~ $20. Stores;

  format Sales -- Returns dollar12.;

  if missing(cats(of Region--Subsidiary)) then delete;

run;


Message was edited by: Scott Bass:  Added update that IGNOREDOSEOF did not solve this issue.

Super Contributor
Posts: 376

Re: Read CSV file with embedded CRLF

Ok, this worked...

data foo;

infile"c:\temp\temp2.csv"dsd dlm="|" truncover firstobs=2 termstr=CRLF;

  input Sales Inventory Returns Region ~ $10. Product ~ $20. Subsidiary ~ $20. Stores;

  format Sales -- Returns dollar12.;

  if missing(cats(of Region--Subsidiary)) then delete;

run;


But this failed:


data foo;

infile"c:\temp\temp2.csv"dsd dlm="|" truncover firstobs=2;

  input Sales Inventory Returns Region ~ $10. Product ~ $20. Subsidiary ~ $20. Stores;

  format Sales -- Returns dollar12.;

  if missing(cats(of Region--Subsidiary)) then delete;

run;


Again, the SAS doc:


TERMSTR=

specifies the end-of-line character for the file. Use this option to share files between the UNIX and Windows operating environments. If termstr is not specified, a single LF or a CRLF function as the end of line character. If termstr=CRLF, then CRLF functions as the EOL character. The following are under Windows:

CRLF

Carriage return line feed. Use TERMSTR=CRLF to read Windows or DOS formatted files. CRLF is the default.

LF

Line feed. Use TERMSTR=LF to read UNIX formatted files. If a file contains CRLF characters, the CR functions as part of the data and not an end of line character.

CR

Carriage Return. Use TERMSTR=CR if the end of line character is a CR.


So, if I run the code without specifying the default TERMSTR=CRLF, it fails.  If I explicitly specify the default TERMSTR=CRLF, it works.

Sigh...so I assert that whatever the Windows default is, it's not TERMSTR=CRLF.  Perhaps a little birdie will enter a doc bug.

BTW, this is SAS 9.3 on Windows 7.  Perhaps SAS 9.4 works differently...

PROC Star
Posts: 7,364

Re: Read CSV file with embedded CRLF

: I'm obviously not a little birdie, but you missed the following line from the documentation: "If termstr is not specified, a single LF or a CRLF function as the end of line character."

Valued Guide
Posts: 3,208

Re: Read CSV file with embedded CRLF

Scott, I do not know to wh you are referring. I was the one for csv using comma as a field separator or the comma as ad decimal point.

The CR/LF combination of Arthur is a good one as that one keeps the difference between those two.

Would you transfer the file to Unix using a ftp program in ascii-mode it will translate those CR/LF to just the LF. That eliminates the important difference.

Avoiding a ascii mode transfer and using a binary mode would have no negative side effects for that file.

Doing this for other type of files (eg sh scripts) you can cause unreadable files on the Unix system without being noticed.

There was a time (s)ftp did have encoding tables included. They are mostly gone with all Unicode type encodings.

It would be better for having this better documented. Perhaps some KB/note paper "101 windows/unix files"  

---->-- ja karman --<-----
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 14 replies
  • 5296 views
  • 7 likes
  • 9 in conversation