BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
n6
Quartz | Level 8 n6
Quartz | Level 8

Someone is creating a CSV for me and then I deal with the CSV trying to bring it into SAS.

 

There are some text fields that contain commas in the CSV, which SAS doesn't like, understandably, since the CSVs use commas at delimiters.  Is the best way to deal with this to do CTRL-F in Excel to find all the commas and get rid of them?  Or should I insist that the person creating the CSV for me get rid of the comma during that step?  I don't know how easy or hard or standard or non-standard that is.

 

And I have another issue.  I'm not sure if this is caused by " or by a carriage return in Excel.  Sometimes when I bring data in I don't get an error at that step but then I look at the data and there are some obvious errors.  Sometimes something at the end of a row gets wrapped into the beginning of the next row and it may be that a carriage reutrn in Excel is the problem.  If the text field is like this

 

blah blah blah blah blah

 

then it's okay.  But sometimes in the Excel cell you see only

 

blah blah blah

 

and then you make the row wider (in the up and down sense) and you see the data is actually

 

blah blah blah

blah blah

 

IOW, it's continue below not because it's too long for the cell (because the data in the other cells happily overflows into the next cell) bur rather because there is a carriage return within that cell.  I remove the carriage return and then it's all in one cell.

 

Again, I don't know if the best way to deal with that is try to detect them and then manually remove the carriage returns or instead tell the person creating the CSV for me to fix that before it gets to me. 

 

So that's about it, although I do get some quotes in my data, although they're only in one cell in the Excel and there appear to be more of them when I bring it into SAS.  But perhaps a CTRL-F and deleting the quotes manually like with the comma would fix that.  I don't think the commas or the quotes matter in terms of data content so from that perspective they could be removed without consequence.

 

That's it.  TIA for any thoughts.

 

1 ACCEPTED SOLUTION

Accepted Solutions
n6
Quartz | Level 8 n6
Quartz | Level 8

 

Someone here has enlightened on this.  I was uninformed about some aspects of CSVs.  I thought that since when you opened a CSV in Excel you didn't see quotes that meant that there were no quotes in CSVs.  He told me, just as you did, that text fields in CSVs are inside quotes and it's just that Excel doesn't show them that made me think they weren't there.

 

I learned, with his help, that the ultimate source of all my problems was a carriage return inside those quotes.  You can have a comma inside quotes and SAS will ignore that when bringing in data, but if you have a carriage return or a line feed inside quotes then SAS thinks the field is done with.

 

My friend here pointed me to a macro that he said would solve the problem and it seems to have solved it.  So that others can benefit I'm including the code below, although I don't understand it.  Thanks for your assistance.  If I can figure out how to do it I'm going to mark this post as "Question Answered" so it turns green and people in the future will find this post and thus the code below.


*************************************;
%MACRO RemoveQuotedCRLF( filename = ) ;
/*--------------------------------------------------------
remove embedded CR/LFs in CSV files containing open text/comment fields
parameters:
filename this is complete filename of CSV file to be fixed

Note: CSV file will be overwritten with fixed file.
It can then import into SAS
--------------------------------------------------------*/;

%if not %sysfunc( fileexist( "&filename." ) ) %then %do ;
%put %str(E)RROR: The file: &filename was not found. ;
%return ;
%end ;

data _null_ ;
infile "&filename." recfm = n sharebuffers ;
file "&filename." recfm = n ;

retain opendoublequote 0 ;
input byte $char1. ;

if byte = '"' then opendoublequote = ^( opendoublequote ) ;
if opendoublequote then do ;
if byte = '0D'x then put " " ; /* CR */
else if byte = '0A'x then put " " ; /* LF */
end ;
run ;
%MEND;

View solution in original post

5 REPLIES 5
ballardw
Super User

If it is a proper CSV file then the fields with embedded columns should have the value included in quotes. That is DSD option in a data step and generally used in code generated by Proc Import to read CSV files.

 

HOW are you currently attempting to read the data? Proc Import, LIbname statement, data step, wizard? and which version of SAS are you using.

 

Is the file created in Excel or does some one edit a CSV file created by a different application it to create the embedded line break?

n6
Quartz | Level 8 n6
Quartz | Level 8

Here is the code that creates it.  This was created by having SAS create code for me in the past and then I copied and pasted it and use the same template each time.  The code below is in a macro but you get the idea.


%MACRO BID (dset);
PROC IMPORT OUT= WORK.&DSET
DATAFILE= "Path\&DSET..csv"
DBMS=CSV REPLACE; GETNAMES=YES; DATAROW=2; GUESSINGROWS=100;
RUN;
%MEND BID;

 

Actually, I had never used GUESSINGROWS before and I added it myself to solve a problem of some character variables further down in the Excel sheet having only 1 character once they're in SAS and GUESSINGROWS corrected that problem but that's unrelated to my current problem.

 

The data is on Oracle to begin with.  Someone I work with writes a PHP script to convert it to a CSV and then I use the code above to bring the CSV into SAS.

 

Commas in the data aren't in quotes.  Someone enters the data in Oracle and the text fields sometimes have commas and then when the person that convert it into CSV via a PHP script does their thing, the commas are still there.  Maybe there is a better way to do it in PHP so that by the time it gets to a CSV the commas are already gone.

ballardw
Super User

@n6 wrote:

Here is the code that creates it.  This was created by having SAS create code for me in the past and then I copied and pasted it and use the same template each time.  The code below is in a macro but you get the idea.


%MACRO BID (dset);
PROC IMPORT OUT= WORK.&DSET
DATAFILE= "Path\&DSET..csv"
DBMS=CSV REPLACE; GETNAMES=YES; DATAROW=2; GUESSINGROWS=100;
RUN;
%MEND BID;

 

Actually, I had never used GUESSINGROWS before and I added it myself to solve a problem of some character variables further down in the Excel sheet having only 1 character once they're in SAS and GUESSINGROWS corrected that problem but that's unrelated to my current problem.

 

The data is on Oracle to begin with.  Someone I work with writes a PHP script to convert it to a CSV and then I use the code above to bring the CSV into SAS.

 

Commas in the data aren't in quotes.  Someone enters the data in Oracle and the text fields sometimes have commas and then when the person that convert it into CSV via a PHP script does their thing, the commas are still there.  Maybe there is a better way to do it in PHP so that by the time it gets to a CSV the commas are already gone.


Since the CSV is created by some one using PHP that script should quote the values when a comma is embedded, or for ease, quote all character values. That would also likely be the easiest place to strip any line feed or carriage return characters imbedded in a text field.

 

As an experiment you might start with Excel and enter some data into cells with commas and then export the file to CSV using Excel's file>save as menu. Then open the CSV file with a text editor such as NotePad or even the SAS editor.

You should get something like:

"This is a field with a, in the middle",This has no comma

note the quotes. That is common for a CSV file. If the program creates data where the fields that contain commas then something has to be done so you know where a field actually starts and ends. Quotes are the most common.

 

SAS will not have problem if all of the character values are quoted unless you do something to force including the quotes in the values.

n6
Quartz | Level 8 n6
Quartz | Level 8

 

Someone here has enlightened on this.  I was uninformed about some aspects of CSVs.  I thought that since when you opened a CSV in Excel you didn't see quotes that meant that there were no quotes in CSVs.  He told me, just as you did, that text fields in CSVs are inside quotes and it's just that Excel doesn't show them that made me think they weren't there.

 

I learned, with his help, that the ultimate source of all my problems was a carriage return inside those quotes.  You can have a comma inside quotes and SAS will ignore that when bringing in data, but if you have a carriage return or a line feed inside quotes then SAS thinks the field is done with.

 

My friend here pointed me to a macro that he said would solve the problem and it seems to have solved it.  So that others can benefit I'm including the code below, although I don't understand it.  Thanks for your assistance.  If I can figure out how to do it I'm going to mark this post as "Question Answered" so it turns green and people in the future will find this post and thus the code below.


*************************************;
%MACRO RemoveQuotedCRLF( filename = ) ;
/*--------------------------------------------------------
remove embedded CR/LFs in CSV files containing open text/comment fields
parameters:
filename this is complete filename of CSV file to be fixed

Note: CSV file will be overwritten with fixed file.
It can then import into SAS
--------------------------------------------------------*/;

%if not %sysfunc( fileexist( "&filename." ) ) %then %do ;
%put %str(E)RROR: The file: &filename was not found. ;
%return ;
%end ;

data _null_ ;
infile "&filename." recfm = n sharebuffers ;
file "&filename." recfm = n ;

retain opendoublequote 0 ;
input byte $char1. ;

if byte = '"' then opendoublequote = ^( opendoublequote ) ;
if opendoublequote then do ;
if byte = '0D'x then put " " ; /* CR */
else if byte = '0A'x then put " " ; /* LF */
end ;
run ;
%MEND;

AngusLooney
SAS Employee

The trouble, at times, with "CSV" is that it isn't really a strong standard. And Excel's handling and creation of them can be positive and negative.

 

But the issue it sounds like you're facing is the infamous "carriage returns in cells/fields" issue. For example, the user is typing in a cell, and hits alt-return, treating the cell almost as a cell in a Word table. This can equally happen when user paste text into a free text field in a database front end.

 

In essence, when the row of data or row in Excel is written to a CSV, you can end up with something like

 

Colc_1,Col_2,Col_3

1,123,"Some text"

2,456,"Sentence followed by a return.

 Then another sentence."

3,678,"Back to well behaved"

 

Read these can be very tricky, especially if then then throw in having commas in the quoted text.

 

My fix for this, was to write a CSV remediation process, when if you told it the number of columns in the CSV, it would parse the file, line by line, and "fix" issues like the one on record 2 above, converting it to.

 

Colc_1,Col_2,Col_3

1,123,"Some text"

2,456,"Sentence followed by a return.<cr> Then another sentence."

3,678,"Back to well behaved"

 

It allowed you to substitute whatever you wanted for errant carriage returns inside quote text fields.

 

You do have to make certain assumptions.

1) the CSV is regular/rectangular - a fixed number of columns, with no deviation

2) text is "properly" quoted

 

We applied this to CSV files BEFORE attempting to ingest them using normal techniques, it wasn't a process that did the fixing as part of ingesting them. Our ingest process tracked and reconciled the CSV files, so we had "lineage/audit" on the fact that we'd altered the input files.

 

One further technique that's useful, you can use a combination of the Unix/Linux (or suitable capable Windows ports) commands grep, wc and uniq to detect any CSV files that are "broken" in this way, essentially you can test for each line only having the expected number of commas (allowing for commas inside quoted text) and any that have lines with more or less than the expected commas are "bad". Any with any lines with too many are structurally flawed and irredeemable. Any with any less may well be fixable, using the above process.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1619 views
  • 0 likes
  • 3 in conversation