05-18-2016 04:46 PM
I have a large csv file that export from another data source, I want to convert to SAS dataset. The file has over 100 columns and some columns has multiple line value (people paste email information here). I tried both proc import and data step, it could only import the first line, the second line overwrite the next column or wrap to next row. Here is an example of description field:
first row: There are 5 Black Belt in Italy.
second row: They are using Excel
my code is:
infile "c:\test" missover dsd lrecl=32767 firstobs=2;
informat description $1000.;
format description $1000.;
input description $;
Any ideas how to fix the code?
Thank you very much!
05-18-2016 04:55 PM
Easiest in the long run is likely to go to the data source that the file is created from and have that field remove any of the carriage return, linefeed or what ever character is used to create the multiple lines. This becomes more important if this process is to be repeated as this type of bad data will cause issues forever otherwise.
If this is a one time thing and there are not millions of lines then using a text editor such as notepad you can delete the end of line markers manually (they will always be at the end of line so not hard to find) and then read the file.
Your problem is that programs look can look for patterns but you have to be able to describe the pattern. Proc import doesn't allow much and even a datastep means you have to be able to figure out how and that is not going to be a trivial exercise.
05-18-2016 05:06 PM
In a well defined CSV file, a field with multiple values would be wrapped in quotation marks? Is that the case here?
What happens if you open the file in Excel, how do those fields look?
05-18-2016 06:33 PM
I am attaching a very small example of what I think is the problem. I have run into these and the issue is the characters getting treated as end of line mean the MISSOVER or TRUNCOVER do not finish reading the first line, then the coninued text is attempting to be read into the incorrect variables. I changed the extension to TXT so that it uploads and is readable easily.
Very basic code to read that file:
data junk; infile "d:\data\examplecsv.txt" dlm=',' firstobs=2 missover; informat id Var1 best4. description $200. var3 best4. ; input id var1 description var3; run;
Lots of Invalid data for id messages with this kind of data.
The file was generated use Excel and Alt-Enter to put vertical spaces within a single cell.
05-24-2016 01:44 PM - edited 05-24-2016 01:54 PM
Your example file has embedded CR+LF characters. But when I make such a file from Excel it only embeds LF characters.
You should be able to read a file with embedded LF characters and end of lines marked by CR+LF if you add the TERMSTR=CRLF option to your infile statement.
If that doesn't work then you can use code like this to convert any CR or LF in the middle of a quoted string to a pipe character.
filename copy temp; data _null_; infile 'exampleCSV.txt' recfm=n ; file copy recfm=n; input ch $char1.; retain q 0; q = mod(q+(ch='"'),2); if q and ch in ('0D'x,'0A'x) then put '|'; else put ch $char1. ; run;
Then you should be able to read the file.
data _null_; infile copy dsd truncover; length x1-x5 $200; input x1-x5 ; put (_all_) (=); run; x1=Id x2=Var1 x3=Description x4=Var3 x5= x1=1 x2=23 x3=A simple comment x4=17 x5= x1=2 x2=18 x3=A comment entered|when using Alt-Enter in |Excel to create a multiline single entry x4=44 x5= x1=3 x2=29 x3=Another simple comment x4=55 x5= NOTE: 4 records were read from the infile COPY.
05-18-2016 06:26 PM
See if your situation is similar to my simulated example:
proc print data=test; run; Obs text 1 "This is one line 2 and another line 3 and a third line" 4 Some text in line two data want; length line $1000; do until ((char(line,1) = '"') = (substr(line,length(line),1) = '"')) ; set test; line = catx(" ", line, text); end; drop text; run; proc print data=want; run; Obs line 1 "This is one line and another line and a third line" 2 Some text in line two
05-24-2016 01:10 PM
Thank you for all your suggestions. I have tried all the ideas. Since the data is not clean, none of them can completed resolve the issue, I decide to move the column to the last and remove from import if necessary.