09-24-2014 04:41 PM
(new DI Studio user)
I have a DI Studio job that uses External File and File Reader nodes to import data from comma delimited csv Excel files. Each observation has about 25 columns with one being a 'product description' column with wildly varying text.
What I am seeing is that after imported into DI Studio, I check the SAS Dataset and some (very few) of the data in this description column looks to be proceeded with a single quote. For example: "HP - Printer base with media .....
Anyway, the original, source file does _not_ have this single quote at the front. I have even run the problem cell text through =CLEAN and =TRIM in excel. Nothing. I've even run '=IF(CLEAN(A1)=A1,"NA","Needs Cleaning") and I'm getting 'NA' as a result.
The data looks good.
So, how is this a problem? Well, any row that has this single quote at the front of the description has all rows after it corrupted. If that cell had a comma in it, then the data after the comma is pushed to the next column and all columns after are blank. If the description does _not_ have a comma in it, then simply every column after it is blank.
Note, rows that have commas in the description are still retained within the description so long as that single quote isn't at the front. So I think the External File and File Reader node is working on the comma delimited excel file correctly.
I hope I was clear and does anyone have any idea what could be causing this single quote to appear and how I can get it removed/stop corrupting the rest of the row?
09-24-2014 06:05 PM
You need to look at your csv not using Excel which might "hide" stuff from you but using a text editor like Notepad++
With Notepad++ under menu "View/Show Symbol" select "Show all characters". Are there any special characters around in the strings you're having problems with?
Also with Notepad++: What does it tell you is the encoding of the csv (under menue "Encoding")?
09-25-2014 10:52 AM
Good call, Patrick.
I opened the file up in Notepad and I can see that there are actually pairs of quotes around some of the data. It's not consistent which columns have quotes around them, but it seems to wreck havoc with DI Studio as it attempts to import the data via comma delimiters.
Is there a way to basically ignore these quotes? Will it cause description data to still push data into incorrect columns if there are commas in the description?
09-25-2014 11:05 AM
I suspect if you look at the quoted data it will be values with a comma in them.
Once upon a time we had comma separated quote qualified, which meant all of the text values had double quotes. But later versions of CSV seem to only quote the values with the commas, hence inconsistent layouts. And it gets worse if you data contains quotes.
09-25-2014 11:12 AM
I'm seeing that to be apparently true. All quoted data (so far) appears to be the values with commas in them.
...so, how is this handled?
09-25-2014 12:25 PM
SAS will read the values properly when reading in the data.
For example just try reading all of the data into character variables and see what you get. Code below should read the first 20 columns.
data example ;
infile 'myfile.csv' dsd dlm=',' truncover lrecl=30000;
length col1-col20 $200 ;
09-29-2014 12:42 PM
(I apologize for the late response, I was out of the office for a few days...)
Apparently, when the 'Logical record length' of the External File is set smaller than some of the rows, DI Studio naturally doesn't handle missing quot pairs correctly. Once, I upped that record length to the combined length of the columns all is well.
Not sure why this took me so long to figure out. I guess the time away was good to clear out the noggin.
09-25-2014 12:23 PM
SAS should be able to read a CSV file generated by Excel.
Did you try just reading the file with a simple data step? Does it read the fields properly?
Perhaps it is just an issue for DI?
Or is it possible the file is read ok but the extra single quotes in some values is causing trouble later on when using the values?