Text mining and content categorization

Importing Pipe delimited file with Carriage Returns in String

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 9
Accepted Solution

Importing Pipe delimited file with Carriage Returns in String

I need to import a pipe delimited file in which the character columns are within quotes (" "). However some of these columns contain a Carriage Return (probably when enter pressed while filling in the data in same cell). Hence the data is looking something like: 

 

Var1 | "Var2" | Var3

1 | "Hi. This contains a break

here" | 105

2 | "Another break

here and

here" | 109

 

I am using SAS 9.2. Can someone help me out in order to resolve the issue and import it into SAS as :

 

Var1 | "Var2" | Var3

1 | "Hi. This contains a break here" | 105

2 | "Another break here and here" | 109

 

Thanks.

Gaurav


Accepted Solutions
Solution
‎03-16-2016 08:52 AM
Trusted Advisor
Posts: 1,114

Re: Importing Pipe delimited file with Carriage Returns in String

Thanks for suggesting your own solution. I see two possibly minor drawbacks:

  1. According to your first post you wanted to replace line breaks with single blanks. With the suggested approach, however, the lines (within a quoted string) would be concatenated without inserting blanks.
  2. Special characters from the range '80'x - 'FF'x (e.g. "µ") would be removed.

 

Alternatively, you could first find out what type of line break really occurs within the quoted strings. Just open the text file with a hex editor or read (relevant parts of) it with SAS using INFILE with RECFM=F and display the contents in $HEXw. format.

 

Example: 

data _null_;
length string $32;
infile cdata recfm=f lrecl=32 truncover obs=10;
input string $char32.;
put string $hex64.;
run;

 

You wrote that @data_null__'s promising suggestion "didn't work out." This would be the case if the line breaks in your quoted strings were CRLF ('0D0A'x) characters.

 

In this case you could use the little program from http://support.sas.com/techsup/technote/ts673.pdf, p. 6 (which is similar to yours, but targeted at LF characters) to overwrite the LFs ('0A'x) within quoted strings with blanks. (To avoid overwriting the original text file, you could, of course, use file outdata ... as in your approach.)

 

You could then read the modified file as usual, remove the remaining CR ('0D'x) characters and compress multiple blanks (from multiple line breaks as in your sample data) to single blanks (unless there are other multiple blanks which are worth preserving):

 

data test;
length ID_IYXZ ID_LMNO 8 TX_ABCD $256; /* please adapt length as appropriate */
infile outdata dlm='|' dsd firstobs=2;
input ID_IYXZ ID_LMNO TX_ABCD :$quote.;
TX_ABCD=compbl(compress(TX_ABCD, '0D'x));
run;

 

 

 

View solution in original post


All Replies
Respected Advisor
Posts: 3,773

Re: Importing Pipe delimited file with Carriage Returns in String

It is probably linefeed 0Ax.  Try using INFILE statement option TERMSTR=CRLF.   Assuming your records are DOS based.  

Occasional Contributor
Posts: 9

Re: Importing Pipe delimited file with Carriage Returns in String

 

Thanks for replying. I already tried the option but it didn't work out.

Is there any other alternative available?

Grand Advisor
Posts: 16,916

Re: Importing Pipe delimited file with Carriage Returns in String

Please post a sample of your file.

 

 

Occasional Contributor
Posts: 9

Re: Importing Pipe delimited file with Carriage Returns in String

I have edited the text of the data in order to remove confidential information. But the structure is the same.
____________________________________________________________
ID_IYXZ|ID_LMNO|TX_ABCD|
1|343|"O QUE YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYL?"|
1|345|"O ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ?"|
1|347|"PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP?"|
1|344|"SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS?

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ."|
___________________________________________________________________________
Occasional Contributor
Posts: 9

Re: Importing Pipe delimited file with Carriage Returns in String

[ Edited ]

I found this as a working solution.

Working on the raw csv in binary format and removing special characters, then simply proc import the new csv.

 

 

data _null_;
infile cdata lrecl=1 recfm=n end=eof;
file outdata lrecl=1000000;
nq=0;
do until (eof or (char='0A'X and mod(nq,2)=0));
input char $char1.;
nq=nq+(char='"') ;
if char not in ('0D'x,'0A'x) then do;
char1 = prxchange('s/[^\x20-\x7E\x0A\x0D]//', -1, char);
if missing(char)=1 or missing(char1)=0 then put char1 $char1. @ ;
end;
end;
put ;
run;

 

Can someone confirm this and please explain any drawbacks/problems this might cause?

Solution
‎03-16-2016 08:52 AM
Trusted Advisor
Posts: 1,114

Re: Importing Pipe delimited file with Carriage Returns in String

Thanks for suggesting your own solution. I see two possibly minor drawbacks:

  1. According to your first post you wanted to replace line breaks with single blanks. With the suggested approach, however, the lines (within a quoted string) would be concatenated without inserting blanks.
  2. Special characters from the range '80'x - 'FF'x (e.g. "µ") would be removed.

 

Alternatively, you could first find out what type of line break really occurs within the quoted strings. Just open the text file with a hex editor or read (relevant parts of) it with SAS using INFILE with RECFM=F and display the contents in $HEXw. format.

 

Example: 

data _null_;
length string $32;
infile cdata recfm=f lrecl=32 truncover obs=10;
input string $char32.;
put string $hex64.;
run;

 

You wrote that @data_null__'s promising suggestion "didn't work out." This would be the case if the line breaks in your quoted strings were CRLF ('0D0A'x) characters.

 

In this case you could use the little program from http://support.sas.com/techsup/technote/ts673.pdf, p. 6 (which is similar to yours, but targeted at LF characters) to overwrite the LFs ('0A'x) within quoted strings with blanks. (To avoid overwriting the original text file, you could, of course, use file outdata ... as in your approach.)

 

You could then read the modified file as usual, remove the remaining CR ('0D'x) characters and compress multiple blanks (from multiple line breaks as in your sample data) to single blanks (unless there are other multiple blanks which are worth preserving):

 

data test;
length ID_IYXZ ID_LMNO 8 TX_ABCD $256; /* please adapt length as appropriate */
infile outdata dlm='|' dsd firstobs=2;
input ID_IYXZ ID_LMNO TX_ABCD :$quote.;
TX_ABCD=compbl(compress(TX_ABCD, '0D'x));
run;

 

 

 

Occasional Contributor
Posts: 9

Re: Importing Pipe delimited file with Carriage Returns in String

Thank you so much. You were right about the drawbacks.
We are working with Portuguese text and some characters with accents were getting removed. But a small tweak in the characters we need removed solved that issue.
Grand Advisor
Posts: 9,463

Re: Importing Pipe delimited file with Carriage Returns in String

This could give you a start . You need change the code according to your real data. Especiall which one is the start of a row .

 

data have;
infile '/folders/myfolders/x.txt' dsd delimiter='|'  recfm=n ;
input (x) (: $40.) @@ ;
x1=scan(x,1,'0D0A'x);
x2=scan(x,2,'0D0A'x);
if notdigit(strip(x2))=0 and not missing(x2) then do;
 x=x1;output;
 group+1;x=x2;output;
end;
else output;
drop x1 x2;
run;

proc transpose data=have out=want(drop=_:);
by group;
var x;
run;

 

 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 904 views
  • 3 likes
  • 5 in conversation