BookmarkSubscribeRSS Feed
Lost_Gary
Quartz | Level 8

I have a CSV file that is kind of messy as it is a system extract file where at least 1 of the 800+ data fields contains potential commas from a comment field.  If I import this using the EG import wizard it seems to handle these 'extra' commas without any issue (i.e. comment field contains the comment detail with or without commas).  If I use proc import or the infile, I receive a number of rows that chop the data at the various commas from this comment field.  This happens even when i copy the infile code from the wizard and try to run it on my own.  

 

Are there any commands or tricks to get my imports to behave as they do within the import wizard?  

12 REPLIES 12
ballardw
Super User

So your "infile" code and example data.

Typically the INFILE option DSD would be used if the commas appear in a proper CSV format with them inside quote:

 

,"This, even though it contains , would be read as a single value with the DSD option",nextvalue

 

If the file does not have quotes when it is contains the delimiter character inside values then it really isn't a proper CSV file.

 

Since proc import generally defaults to using DSD one actually suspects that your problem may not be commas as such but that part of the comment contains a character that might be treated as end of line that EG is ignoring. But have to see an actual example. Or look at your file in a program like Notepad+ that shows non-printable characters.

Tom
Super User Tom
Super User

You cannot just copy the code written by the EG wizard, because that code does NOT run against the original file.  Instead the EG wizard creates a cleaned up verison of the file and uploads that to the SAS server, so that is the file that the code written by EG is designed to read.

 

If PROC IMPORT with GUESSINGROWS=MAX does not read the file properly and EG can then you probably have embedded end of line markers in the value of one or more fields.

 

If you are lucky the embedded end of line markers is just bare LF (or perhaps bare CR) characters, but the ends of the actual data lines are marked with CR+LF pair.  This is the normal end of line marker for a text file on Windows.  To solve that you just need to use the TERMSTR=CRLF option.

filename csv 'myfile.csv' termstr=crlf ;
proc import dbms=csv file=csv ....

If instead you have file where the extra end of line markers are exactly the same as the real end of line markers then probably the only way that EG could read the file properly would be if such values are enclosed in quotes. Otherwise the file cannot really be parsed unambiguously into fields without some additional knowledge about where the extra linebreaks of commas could occur.

 

You can take advantage of the quotes to make a fixed version of the file that has any end of line markers inside of quotes replaced with something else.  See this macro:  %replace_crlf()   The resulting fixed file should be readable by PROC IMPORT.

 

If neither of those fixes work then you will need to share a copy of the file (or a smaller one that exhibits the same symptoms) to get a better plan of attack.

ChrisHemedinger
Community Manager

You can harvest and reuse much of the code that EG generates by following the steps in this article. Some small modifications required from you, but you can use the EG wizard to generate quite a bit of it.

Check out SAS Innovate on-demand content! Watch the main stage sessions, keynotes, and over 20 technical breakout sessions!
Lost_Gary
Quartz | Level 8
I cannot seem to get that macro to work.
ballardw
Super User

@Lost_Gary wrote:
I cannot seem to get that macro to work.

Awful vague.

Are there errors in the log?: Post the code and log in a code box opened with the "</>" to maintain formatting of error messages.

No output? Post any log in a code box.

If the you have problems with a macro set OPTIONS MPRINT; before executing the macro. The LOG will show more details. If you need help still, copy that log and post the text on the forum.

Tom
Super User Tom
Super User

It what way did it not work?

 

First you need to compile the macro:

filename github url 'https://raw.githubusercontent.com/sasutils/macros/master/replace_crlf.sas';
%include github ;

Then you just need to call it with the filerefs (or quoted physical names) of the input and output files.

filename bad 'existing file.csv';
filename good 'fixed file.csv';
%replace_crlf(bad,good)

Then you can trying reading the fixed file to see if PROC IMPORT can now make a guess at how to read it.

proc import file=good out=want replace dbms=csv;
   guessingrows=max;
run;
Lost_Gary
Quartz | Level 8

I've worked with the macro and I finally got it to run, but I had to abort the process as it was turning my file from 200k records to 600m.  I don't think it is end of file indicators that are giving me issues with my data.  It is commas that I know are in comment fields within the data.  How do i input a text qualifier into this type of import.  As the EG import wizard can do, I have to believe there is a way to it within the infile statement of proc import.  

ballardw
Super User

Use a text editor to copy something like the first 5 or 10 lines from the CSV file. On the forum open a text box an paste those lines.

 

And are you sure that this file has never been opened and saved by a spreadsheet program?

Lost_Gary
Quartz | Level 8

I've got it figured out - I am going to make an ODBC Connection to the data.  🤔  Sometimes tough to see the forest from the trees.  

 

Tom
Super User Tom
Super User

@Lost_Gary wrote:

I've got it figured out - I am going to make an ODBC Connection to the data.  🤔  Sometimes tough to see the forest from the trees.  

 


So you are not trying to read in a CSV file? What type of file were you trying to read with PROC IMPORT?

 

If the data is actually in a database that you can connect to with ODBC or any other type of connection that you can copy the data without making any temporary file.

Tom
Super User Tom
Super User

You need to actually LOOK at the file and see if it is a valid CSV file or not.

 

What is the delimiter? Is it actually a comma? Or is it a semicolon? Or a TAB? Or something else?
What character(s) does the file use to mark the end of the lines?  Is it CR+LF like a WINDOWS file? Is in only LF like a UNIX file?  Is it only CR like the original Macintosh used?

 

Do any of the values contain the delimiter?  Are those values enclosed in quotes?

Do any of the values contain quote character? Are those values enclosed in quotes with the actual quotes doubled up?

Do any of the values contain the end of line characters? Are those values enclosed in quotes?

 

Here is a simple way to LOOK at the file if you cannot find a text editor you like.  

Use a data step with the LIST statement.  Use the OBS= option on the INFILE statement to limit the number of lines you read.  You can also use FIRSTOBS= and OBS= together to read some lines form the later in the file.

data _null_;
  infile 'myfile.csv' obs=10;
  input;
  list;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 906 views
  • 0 likes
  • 4 in conversation