BookmarkSubscribeRSS Feed
ppaidami
SAS Employee

I have a large csv file that export from another data source, I want to convert to SAS dataset. The file has over 100 columns and some columns has multiple line value (people paste email information here). I tried both proc import and data step, it could only import the first line, the second line overwrite the next column or wrap to next row. Here is an example of description field:

first row: There are 5 Black Belt in Italy.

second row: They are using Excel

 

my code is:

data test;

infile "c:\test" missover dsd lrecl=32767 firstobs=2;

informat description $1000.;

format description $1000.;

input description $;

run;

 

Any ideas how to fix the code? 

 

Thank you very much!

9 REPLIES 9
ballardw
Super User

Easiest in the long run is likely to go to the data source that the file is created from and have that field remove any of the carriage return, linefeed or what ever character is used to create the multiple lines. This becomes more important if this process is to be repeated as this type of bad data will cause issues forever otherwise.

 

If this is a one time thing and there are not millions of lines then using a text editor such as notepad you can delete the end of line markers manually (they will always be at the end of line so not hard to find) and then read the file.

 

Your problem is that programs look can look for patterns but you have to be able to describe the pattern. Proc import doesn't allow much and even a datastep means you have to be able to figure out how and that is not going to be a trivial exercise.

 

Reeza
Super User

In a well defined CSV file, a field with multiple values would be wrapped in quotation marks? Is that the case here?

 

What happens if you open the file in Excel, how do those fields look?

Riteshdell
Quartz | Level 8

Hi Reeza - I am getting same kind of error.

I ahve a pipe symbol, and a middle column reocords is going continue till 3 last line.

so every line it is reading as quot " ".

 

Please tell me how to fix it,

ChrisNZ
Tourmaline | Level 20

Difficult to know from the data you posted, but it seems from your description that option truncover may help.

ballardw
Super User

I am attaching a very small example of what I think is the problem. I have run into these and the issue is the characters getting treated as end of line mean the MISSOVER or TRUNCOVER do not finish reading the first line, then the coninued text is attempting to be read into the incorrect variables. I changed the extension to TXT so that it uploads and is readable easily.

 

Very basic code to read that file:

data junk;
   infile "d:\data\examplecsv.txt" dlm=',' firstobs=2 missover;
   informat id  Var1 best4.
            description $200.
            var3 best4.
   ;
   input id var1 description var3;
run;

 

Lots of Invalid data for id messages with this kind of data.

The file was generated use Excel and Alt-Enter to put vertical spaces within a single cell.

Tom
Super User Tom
Super User

Your example file has  embedded CR+LF characters.  But when I make such a file from Excel it only embeds LF characters.

You should be able to read a file with embedded LF characters and end of lines marked by CR+LF if you add the TERMSTR=CRLF option to your infile statement.

 

If that doesn't work then you can use code like this to convert any CR or LF in the middle of a quoted string to a pipe character.  

filename copy temp;

data _null_;
  infile 'exampleCSV.txt' recfm=n ;
  file copy recfm=n;
  input ch $char1.;
  retain q 0;
  q = mod(q+(ch='"'),2);
  if q and ch in ('0D'x,'0A'x) then put '|';
  else put ch $char1. ;
run;

Then you should be able to read the file.

data _null_;
  infile copy dsd truncover;
  length x1-x5 $200;
  input x1-x5 ;
  put (_all_) (=);
run;


x1=Id x2=Var1 x3=Description x4=Var3 x5=
x1=1 x2=23 x3=A simple comment x4=17 x5=
x1=2 x2=18 x3=A comment entered|when using Alt-Enter in |Excel to create a multiline single entry x4=44 x5=
x1=3 x2=29 x3=Another simple comment x4=55 x5=
NOTE: 4 records were read from the infile COPY.
PGStats
Opal | Level 21

See if your situation is similar to my simulated example:

 

proc print data=test; run;

                            Obs            text

                             1     "This is one line
                             2     and another line
                             3     and a third line"
                             4     Some text in line two

data want;
length line $1000;
do until ((char(line,1) = '"') = (substr(line,length(line),1) = '"')) ;
    set test;
    line = catx(" ", line, text);
    end;
drop text;
run;

proc print data=want; run;

            Obs    line

             1     "This is one line and another line and a third line"
             2     Some text in line two
PG
ppaidami
SAS Employee

Thank you for all your suggestions. I have tried all the ideas. Since the data is not clean, none of them can completed resolve the issue, I decide to move the column to the last and remove from import if necessary. 

ChrisNZ
Tourmaline | Level 20

Do try @Tom 's TERMST=CRLF suggestion. It could well solve your issue.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 4690 views
  • 5 likes
  • 7 in conversation