- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Good morning everyone,
I am totally novice to csv.
Today I start to download the data from another datasource (xls file) to match two datasets together (another dataset now is under sas7bdat datatype).
Even I am able to use PROC import to import I try to transfer all data to one excel sheet and try to save under CSV type to try what @ballardwand other experts mentioned many times to me with the reason that " Excel columns do not have a data type. Any cell can hold anything. So data "imported" from Excel is subject to all sorts of potential error"( good chance for me to challenge myself practically).
Because I want to have a smooth start so I try to merge all the data to one excel sheet before converting from xlsx type to csv type.
Then I use the "Save As" function of Excel to save xlsx to the CSV type.
And there are two things that confuse me here:
1. Why the csv file is much bigger than the original xlsx file, while I think csv file mainly contains text:
You can see the picture as below
2. Why I open the csv file, it pops up like that
Is there any problem here or I just ignore this one?
3. @andreas_lds mentioned:
"If there is only one sheet in each file, just click "OK" and everything is fine.
Important: Don't open csv-files with excel, use a text editor (like notepad++) instead."
Could you please tell me the logical reason behind it?
I ask this question because it messed up when I open the file by Notepad, I can't read the data clearly.
The original data (xlsx type)
When I use Notepad, it messes up to me
Warm regards.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just be aware that what you see in Excel may not be the actual value. If you see anything "odd" it may well be one of these conversions. DO NOT 'edit' and the save back to the same CSV. You may well lose the original information. If you think you must, save to a new file and then compare. That to the original. Yes it is a pain dealing with CSV in text format. Be glad it isn't fixed column input with almost no white space. Commas are easy. Tab delimited is fun because you often can't tell where columns end depending on the lengths of values and the tab setting for the viewing software.
My comment about @#$%#%$# is the data source will randomly change the number of header rows in the spreadsheets. Sometimes add one or two, sometimes remove one or two, sometimes the same as last time. I have to manually check which row every one of the spreadsheet pages I use from that source starts data I can set the Firstobs correctly in my Infile statements.
Not only that I have to check the contents. I have had file identification fields appear as currency or dates, currency fields appear as dates, and dates appear unformatted. A CSV saved from such will have the cell contents appear as such in the text version. So I often have to use the Excel "format cells" on one or more columns to set the properties so the export will read correctly. For added fun I have that source change the format like 400 rows into the file. The top values look good and then my CSV read program throws errors because a currency field should not contain / (date delimiters) or the dates contain $. One of the things that really chaps my hide is the spreadsheets are generated by a "program" but the source won't tell me how the columns acquire the properties because the column that is currency one month is date the next and "general", as in Excel general cell format, the next.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Generally that not about "lost" is stuff like cell highlighting, background colors, fonts and such.
CSV files do not have any of those sorts of enhancements as simple text files.
What you do have to watch for as it can screw with the CSV delimiters are merged cells. One merged cell in the middle of data, that may not be noticeable, can cause a LOT of invalid data messages.
I routinely SAVE AS CSV as many as 5 or 6 sheets out of XLSX files.
Then count the header rows because the !@#$@#$! sources keep changing them so I have modify my programs for Firstobs.
Then double check column orders, because they can't keep columns in the same order. Hint: if you have an Informat / length block of code defining variable properties in a data step you only need to change the INPUT statement order (as long as the actual column contents don't change). Hint: Specify variables in the Informat / length code in the column order you would like to see things, input order doesn't matter (much).
If you open a text CSV file (or tab delimited or other text formats) Excel has been known to look at the contents and make decisions about what the value should be. So suppose you have a column like part numbers that may look like 73-123. Excel may see a value like 5-18 and "decide" you meant this to be a date like May 18, will do so, store it internally as a date, for that cell only, resulting in a value in the 47568 range depending on which year is being considered. Then if you import the data you may get 47568. Of if you save the file back to CSV the results can be unpredictable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @ballardw
Thank you for a lot of dedicated information and suggestion, I may go step by step
Regarding this part
What you do have to watch for as it can screw with the CSV delimiters are merged cells. One merged cell in the middle of data, that may not be noticeable, can cause a LOT of invalid data messages.
So, normally, we can only check by opening by Excel rather than Notepad, right, because I feel messy when opening the file by Notepad, and it seems to be no way for me to check such a aforementioned thing.
And
Then count the header rows because the !@#$@#$! sources keep changing them so I have modify my programs for Firstobs.
Can I ask what is the "count the header rows because the !@#$@#$! sources keep changing them" then?
Cheers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just be aware that what you see in Excel may not be the actual value. If you see anything "odd" it may well be one of these conversions. DO NOT 'edit' and the save back to the same CSV. You may well lose the original information. If you think you must, save to a new file and then compare. That to the original. Yes it is a pain dealing with CSV in text format. Be glad it isn't fixed column input with almost no white space. Commas are easy. Tab delimited is fun because you often can't tell where columns end depending on the lengths of values and the tab setting for the viewing software.
My comment about @#$%#%$# is the data source will randomly change the number of header rows in the spreadsheets. Sometimes add one or two, sometimes remove one or two, sometimes the same as last time. I have to manually check which row every one of the spreadsheet pages I use from that source starts data I can set the Firstobs correctly in my Infile statements.
Not only that I have to check the contents. I have had file identification fields appear as currency or dates, currency fields appear as dates, and dates appear unformatted. A CSV saved from such will have the cell contents appear as such in the text version. So I often have to use the Excel "format cells" on one or more columns to set the properties so the export will read correctly. For added fun I have that source change the format like 400 rows into the file. The top values look good and then my CSV read program throws errors because a currency field should not contain / (date delimiters) or the dates contain $. One of the things that really chaps my hide is the spreadsheets are generated by a "program" but the source won't tell me how the columns acquire the properties because the column that is currency one month is date the next and "general", as in Excel general cell format, the next.