BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Phil_NZ
Barite | Level 11

Good morning everyone,

 

I am totally novice to csv.

 

Today I start to download the data from another datasource (xls file) to match two datasets together (another dataset now is under sas7bdat datatype).

Even I am able to use PROC import to import I try to transfer all data to one excel sheet and try to save under CSV type to try what @ballardwand other experts mentioned many times to me with the reason that " Excel columns do not have a data type. Any cell can hold anything. So data "imported" from Excel is subject to all sorts of potential error"( good chance for me to challenge myself practically).

 

Because I want to have a smooth start so I try to merge all the data to one excel sheet before converting from xlsx type to csv type. 

Then I use the "Save As" function of Excel to save xlsx to the CSV type.

And there are two things that confuse me here:

 

1. Why the csv file is much bigger than the original xlsx file, while I think csv file mainly contains text:

You can see the picture as below

 

My97_0-1614805421151.png

 

2. Why I open the csv file, it pops up like that

 

My97_1-1614805496841.png

Is there any problem here or I just ignore this one?

 

3. @andreas_lds mentioned:

"If there is only one sheet in each file, just click "OK" and everything is fine.

Important: Don't open csv-files with excel, use a text editor (like notepad++) instead."

Could you please tell me the logical reason behind it?

I ask this question because it messed up when I open the file by Notepad, I can't read the data clearly.

The original data (xlsx type)

 

My97_0-1614806219998.png

When I use Notepad, it messes up to me

 

My97_1-1614806251227.png

 

 

 

Warm regards.

 

 

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.
1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Just be aware that what you see in Excel may not be the actual value. If you see anything "odd" it may well be one of these conversions. DO NOT 'edit' and the save back to the same CSV. You may well lose the original information. If you think you must, save to a new file and then compare. That to the original. Yes it is a pain dealing with CSV in text format. Be glad it isn't fixed column input with almost no white space. Commas are easy. Tab delimited is fun because you often can't tell where columns end depending on the lengths of values and the tab setting for the viewing software.

 

 

My comment about @#$%#%$#  is the data source will randomly change the number of header rows in the spreadsheets. Sometimes add one or two, sometimes remove one or two, sometimes the same as last time. I have to manually check which row every one of the spreadsheet pages I use from that source starts data I can set the Firstobs correctly in my Infile statements.

 

Not only that I have to check the contents. I have had file identification fields appear as currency or dates, currency fields appear as dates, and dates appear unformatted. A CSV saved from such will have the cell contents appear as such in the text version. So I often have to use the Excel "format cells" on one or more columns to set the properties so the export will read correctly. For added fun I have that source change the format like 400 rows into the file. The top values look good and then my CSV read program throws errors because a currency field should not contain / (date delimiters) or the dates contain $. One of the things that really chaps my hide is the spreadsheets are generated by a "program" but the source won't tell me how the columns acquire the properties because the column that is currency one month is date the next and "general", as in Excel general cell format, the next.

View solution in original post

3 REPLIES 3
ballardw
Super User

Generally that not about "lost" is stuff like cell highlighting, background colors, fonts and such.

CSV files do not have any of those sorts of enhancements as simple text files.

 

What you do have to watch for as it can screw with the CSV delimiters are merged cells. One merged cell in the middle of data, that may not be noticeable, can cause a LOT of invalid data messages.

 

I routinely SAVE AS CSV as many as 5 or 6 sheets out of XLSX files.

Then count the header rows because the !@#$@#$! sources keep changing them so I have modify my programs for Firstobs.

Then double check column orders, because they can't keep columns in the same order. Hint: if you have an Informat / length block of code defining variable properties in a data step you only need to change the INPUT statement order (as long as the actual column contents don't change). Hint: Specify variables in the Informat / length code in the column order you would like to see things, input order doesn't matter (much).

 

If you open a text CSV file (or tab delimited or other text formats) Excel has been known to look at the contents and make decisions about what the value should be. So suppose you have a column like part numbers that may look like 73-123. Excel may see a value like 5-18 and "decide" you meant this to be a date like May 18, will do so, store it internally as a date, for that cell only, resulting in a value in the 47568 range depending on which year is being considered. Then if you import the data you may get 47568. Of if you save the file back to CSV the results can be unpredictable.

 

 

Phil_NZ
Barite | Level 11

Hi @ballardw 

Thank you for a lot of dedicated information and suggestion, I may go step by step

Regarding this part

What you do have to watch for as it can screw with the CSV delimiters are merged cells. One merged cell in the middle of data, that may not be noticeable, can cause a LOT of invalid data messages.

 

So, normally, we can only check by opening by Excel rather than Notepad, right, because I feel messy when opening the file by Notepad, and it seems to be no way for me to check such a aforementioned thing.

And

Then count the header rows because the !@#$@#$! sources keep changing them so I have modify my programs for Firstobs.

Can I ask what is the "count the header rows because the !@#$@#$! sources keep changing them" then?

Cheers.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.
ballardw
Super User

Just be aware that what you see in Excel may not be the actual value. If you see anything "odd" it may well be one of these conversions. DO NOT 'edit' and the save back to the same CSV. You may well lose the original information. If you think you must, save to a new file and then compare. That to the original. Yes it is a pain dealing with CSV in text format. Be glad it isn't fixed column input with almost no white space. Commas are easy. Tab delimited is fun because you often can't tell where columns end depending on the lengths of values and the tab setting for the viewing software.

 

 

My comment about @#$%#%$#  is the data source will randomly change the number of header rows in the spreadsheets. Sometimes add one or two, sometimes remove one or two, sometimes the same as last time. I have to manually check which row every one of the spreadsheet pages I use from that source starts data I can set the Firstobs correctly in my Infile statements.

 

Not only that I have to check the contents. I have had file identification fields appear as currency or dates, currency fields appear as dates, and dates appear unformatted. A CSV saved from such will have the cell contents appear as such in the text version. So I often have to use the Excel "format cells" on one or more columns to set the properties so the export will read correctly. For added fun I have that source change the format like 400 rows into the file. The top values look good and then my CSV read program throws errors because a currency field should not contain / (date delimiters) or the dates contain $. One of the things that really chaps my hide is the spreadsheets are generated by a "program" but the source won't tell me how the columns acquire the properties because the column that is currency one month is date the next and "general", as in Excel general cell format, the next.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 556 views
  • 2 likes
  • 2 in conversation