Solved: Importing Pipe delimited file with Carriage Returns in String

gaurav_248 · Posted 03-15-2016 02:58 PM

I need to import a pipe delimited file in which the character columns are within quotes (" "). However some of these columns contain a Carriage Return (probably when enter pressed while filling in the data in same cell). Hence the data is looking something like:

Var1 | "Var2" | Var3

1 | "Hi. This contains a break

here" | 105

2 | "Another break

here and

here" | 109

I am using SAS 9.2. Can someone help me out in order to resolve the issue and import it into SAS as :

Var1 | "Var2" | Var3

1 | "Hi. This contains a break here" | 105

2 | "Another break here and here" | 109

Thanks.

Gaurav

FreelanceReinh · Posted 03-15-2016 07:57 PM

Thanks for suggesting your own solution. I see two possibly minor drawbacks:

According to your first post you wanted to replace line breaks with single blanks. With the suggested approach, however, the lines (within a quoted string) would be concatenated without inserting blanks.
Special characters from the range '80'x - 'FF'x (e.g. "µ") would be removed.

Alternatively, you could first find out what type of line break really occurs within the quoted strings. Just open the text file with a hex editor or read (relevant parts of) it with SAS using INFILE with RECFM=F and display the contents in $HEXw. format.

Example:

data _null_;
length string $32;
infile cdata recfm=f lrecl=32 truncover obs=10;
input string $char32.;
put string $hex64.;
run;

You wrote that @data_null__'s promising suggestion "didn't work out." This would be the case if the line breaks in your quoted strings were CRLF ('0D0A'x) characters.

In this case you could use the little program from http://support.sas.com/techsup/technote/ts673.pdf, p. 6 (which is similar to yours, but targeted at LF characters) to overwrite the LFs ('0A'x) within quoted strings with blanks. (To avoid overwriting the original text file, you could, of course, use file outdata ... as in your approach.)

You could then read the modified file as usual, remove the remaining CR ('0D'x) characters and compress multiple blanks (from multiple line breaks as in your sample data) to single blanks (unless there are other multiple blanks which are worth preserving):

data test;
length ID_IYXZ ID_LMNO 8 TX_ABCD $256; /* please adapt length as appropriate */
infile outdata dlm='|' dsd firstobs=2;
input ID_IYXZ ID_LMNO TX_ABCD :$quote.;
TX_ABCD=compbl(compress(TX_ABCD, '0D'x));
run;

View solution in original post

data_null__ · Posted 03-15-2016 03:02 PM

It is probably linefeed 0Ax. Try using INFILE statement option TERMSTR=CRLF. Assuming your records are DOS based.

gaurav_248 · Posted 03-15-2016 03:08 PM

Thanks for replying. I already tried the option but it didn't work out.

Is there any other alternative available?

Reeza · Posted 03-15-2016 03:17 PM

Please post a sample of your file.

gaurav_248 · Posted 03-15-2016 03:38 PM

I have edited the text of the data in order to remove confidential information. But the structure is the same.
____________________________________________________________
ID_IYXZ|ID_LMNO|TX_ABCD|
1|343|"O QUE YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYL?"|
1|345|"O ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ?"|
1|347|"PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP?"|
1|344|"SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS?

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ."|
___________________________________________________________________________

gaurav_248 · Posted 03-15-2016 05:35 PM

I found this as a working solution.

Working on the raw csv in binary format and removing special characters, then simply proc import the new csv.

data _null_;
infile cdata lrecl=1 recfm=n end=eof;
file outdata lrecl=1000000;
nq=0;
do until (eof or (char='0A'X and mod(nq,2)=0));
input char $char1.;
nq=nq+(char='"') ;
if char not in ('0D'x,'0A'x) then do;
char1 = prxchange('s/[^\x20-\x7E\x0A\x0D]//', -1, char);
if missing(char)=1 or missing(char1)=0 then put char1 $char1. @ ;
end;
end;
put ;
run;

Can someone confirm this and please explain any drawbacks/problems this might cause?

FreelanceReinh · Posted 03-15-2016 07:57 PM

Thanks for suggesting your own solution. I see two possibly minor drawbacks:

According to your first post you wanted to replace line breaks with single blanks. With the suggested approach, however, the lines (within a quoted string) would be concatenated without inserting blanks.
Special characters from the range '80'x - 'FF'x (e.g. "µ") would be removed.

Alternatively, you could first find out what type of line break really occurs within the quoted strings. Just open the text file with a hex editor or read (relevant parts of) it with SAS using INFILE with RECFM=F and display the contents in $HEXw. format.

Example:

data _null_;
length string $32;
infile cdata recfm=f lrecl=32 truncover obs=10;
input string $char32.;
put string $hex64.;
run;

You wrote that @data_null__'s promising suggestion "didn't work out." This would be the case if the line breaks in your quoted strings were CRLF ('0D0A'x) characters.

In this case you could use the little program from http://support.sas.com/techsup/technote/ts673.pdf, p. 6 (which is similar to yours, but targeted at LF characters) to overwrite the LFs ('0A'x) within quoted strings with blanks. (To avoid overwriting the original text file, you could, of course, use file outdata ... as in your approach.)

You could then read the modified file as usual, remove the remaining CR ('0D'x) characters and compress multiple blanks (from multiple line breaks as in your sample data) to single blanks (unless there are other multiple blanks which are worth preserving):

data test;
length ID_IYXZ ID_LMNO 8 TX_ABCD $256; /* please adapt length as appropriate */
infile outdata dlm='|' dsd firstobs=2;
input ID_IYXZ ID_LMNO TX_ABCD :$quote.;
TX_ABCD=compbl(compress(TX_ABCD, '0D'x));
run;

gaurav_248 · Posted 03-16-2016 08:54 AM

Thank you so much. You were right about the drawbacks.
We are working with Portuguese text and some characters with accents were getting removed. But a small tweak in the characters we need removed solved that issue.

Ksharp · Posted 03-15-2016 09:32 PM

This could give you a start . You need change the code according to your real data. Especiall which one is the start of a row .

data have;
infile '/folders/myfolders/x.txt' dsd delimiter='|'  recfm=n ;
input (x) (: $40.) @@ ;
x1=scan(x,1,'0D0A'x);
x2=scan(x,2,'0D0A'x);
if notdigit(strip(x2))=0 and not missing(x2) then do;
 x=x1;output;
 group+1;x=x2;output;
end;
else output;
drop x1 x2;
run;

proc transpose data=have out=want(drop=_:);
by group;
var x;
run;

Importing Pipe delimited file with Carriage Returns in String

Re: Importing Pipe delimited file with Carriage Returns in String

Re: Importing Pipe delimited file with Carriage Returns in String

Re: Importing Pipe delimited file with Carriage Returns in String

Re: Importing Pipe delimited file with Carriage Returns in String

Re: Importing Pipe delimited file with Carriage Returns in String

Re: Importing Pipe delimited file with Carriage Returns in String

Re: Importing Pipe delimited file with Carriage Returns in String

Re: Importing Pipe delimited file with Carriage Returns in String

Re: Importing Pipe delimited file with Carriage Returns in String

SAS Innovate 2025: Save the Date