BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
gaurav_248
Calcite | Level 5

I need to import a pipe delimited file in which the character columns are within quotes (" "). However some of these columns contain a Carriage Return (probably when enter pressed while filling in the data in same cell). Hence the data is looking something like: 

 

Var1 | "Var2" | Var3

1 | "Hi. This contains a break

here" | 105

2 | "Another break

here and

here" | 109

 

I am using SAS 9.2. Can someone help me out in order to resolve the issue and import it into SAS as :

 

Var1 | "Var2" | Var3

1 | "Hi. This contains a break here" | 105

2 | "Another break here and here" | 109

 

Thanks.

Gaurav

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Thanks for suggesting your own solution. I see two possibly minor drawbacks:

  1. According to your first post you wanted to replace line breaks with single blanks. With the suggested approach, however, the lines (within a quoted string) would be concatenated without inserting blanks.
  2. Special characters from the range '80'x - 'FF'x (e.g. "µ") would be removed.

 

Alternatively, you could first find out what type of line break really occurs within the quoted strings. Just open the text file with a hex editor or read (relevant parts of) it with SAS using INFILE with RECFM=F and display the contents in $HEXw. format.

 

Example: 

data _null_;
length string $32;
infile cdata recfm=f lrecl=32 truncover obs=10;
input string $char32.;
put string $hex64.;
run;

 

You wrote that @data_null__'s promising suggestion "didn't work out." This would be the case if the line breaks in your quoted strings were CRLF ('0D0A'x) characters.

 

In this case you could use the little program from http://support.sas.com/techsup/technote/ts673.pdf, p. 6 (which is similar to yours, but targeted at LF characters) to overwrite the LFs ('0A'x) within quoted strings with blanks. (To avoid overwriting the original text file, you could, of course, use file outdata ... as in your approach.)

 

You could then read the modified file as usual, remove the remaining CR ('0D'x) characters and compress multiple blanks (from multiple line breaks as in your sample data) to single blanks (unless there are other multiple blanks which are worth preserving):

 

data test;
length ID_IYXZ ID_LMNO 8 TX_ABCD $256; /* please adapt length as appropriate */
infile outdata dlm='|' dsd firstobs=2;
input ID_IYXZ ID_LMNO TX_ABCD :$quote.;
TX_ABCD=compbl(compress(TX_ABCD, '0D'x));
run;

 

 

 

View solution in original post

8 REPLIES 8
data_null__
Jade | Level 19

It is probably linefeed 0Ax.  Try using INFILE statement option TERMSTR=CRLF.   Assuming your records are DOS based.  

gaurav_248
Calcite | Level 5

 

Thanks for replying. I already tried the option but it didn't work out.

Is there any other alternative available?

Reeza
Super User

Please post a sample of your file.

 

 

gaurav_248
Calcite | Level 5
I have edited the text of the data in order to remove confidential information. But the structure is the same.
____________________________________________________________
ID_IYXZ|ID_LMNO|TX_ABCD|
1|343|"O QUE YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYL?"|
1|345|"O ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ?"|
1|347|"PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP?"|
1|344|"SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS?

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ."|
___________________________________________________________________________
gaurav_248
Calcite | Level 5

I found this as a working solution.

Working on the raw csv in binary format and removing special characters, then simply proc import the new csv.

 

 

data _null_;
infile cdata lrecl=1 recfm=n end=eof;
file outdata lrecl=1000000;
nq=0;
do until (eof or (char='0A'X and mod(nq,2)=0));
input char $char1.;
nq=nq+(char='"') ;
if char not in ('0D'x,'0A'x) then do;
char1 = prxchange('s/[^\x20-\x7E\x0A\x0D]//', -1, char);
if missing(char)=1 or missing(char1)=0 then put char1 $char1. @ ;
end;
end;
put ;
run;

 

Can someone confirm this and please explain any drawbacks/problems this might cause?

FreelanceReinh
Jade | Level 19

Thanks for suggesting your own solution. I see two possibly minor drawbacks:

  1. According to your first post you wanted to replace line breaks with single blanks. With the suggested approach, however, the lines (within a quoted string) would be concatenated without inserting blanks.
  2. Special characters from the range '80'x - 'FF'x (e.g. "µ") would be removed.

 

Alternatively, you could first find out what type of line break really occurs within the quoted strings. Just open the text file with a hex editor or read (relevant parts of) it with SAS using INFILE with RECFM=F and display the contents in $HEXw. format.

 

Example: 

data _null_;
length string $32;
infile cdata recfm=f lrecl=32 truncover obs=10;
input string $char32.;
put string $hex64.;
run;

 

You wrote that @data_null__'s promising suggestion "didn't work out." This would be the case if the line breaks in your quoted strings were CRLF ('0D0A'x) characters.

 

In this case you could use the little program from http://support.sas.com/techsup/technote/ts673.pdf, p. 6 (which is similar to yours, but targeted at LF characters) to overwrite the LFs ('0A'x) within quoted strings with blanks. (To avoid overwriting the original text file, you could, of course, use file outdata ... as in your approach.)

 

You could then read the modified file as usual, remove the remaining CR ('0D'x) characters and compress multiple blanks (from multiple line breaks as in your sample data) to single blanks (unless there are other multiple blanks which are worth preserving):

 

data test;
length ID_IYXZ ID_LMNO 8 TX_ABCD $256; /* please adapt length as appropriate */
infile outdata dlm='|' dsd firstobs=2;
input ID_IYXZ ID_LMNO TX_ABCD :$quote.;
TX_ABCD=compbl(compress(TX_ABCD, '0D'x));
run;

 

 

 

gaurav_248
Calcite | Level 5
Thank you so much. You were right about the drawbacks.
We are working with Portuguese text and some characters with accents were getting removed. But a small tweak in the characters we need removed solved that issue.
Ksharp
Super User

This could give you a start . You need change the code according to your real data. Especiall which one is the start of a row .

 

data have;
infile '/folders/myfolders/x.txt' dsd delimiter='|'  recfm=n ;
input (x) (: $40.) @@ ;
x1=scan(x,1,'0D0A'x);
x2=scan(x,2,'0D0A'x);
if notdigit(strip(x2))=0 and not missing(x2) then do;
 x=x1;output;
 group+1;x=x2;output;
end;
else output;
drop x1 x2;
run;

proc transpose data=have out=want(drop=_:);
by group;
var x;
run;

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 4470 views
  • 3 likes
  • 5 in conversation