Hi, All
I made two different file types (xlsx and txt : csv) by using the option, export from a SAS project.
I believe that the size of a text file is smaller than excel file only except large dataset over a million rows (becuase excel file can store only 2x2^20 - around a million rows).
However, when I export data from SAS with these two types of file by using a same data, I found that some dataset saved as excel(.xlsx) is much smaller than text type.
Is there any automatically applied compression process when the dataset is exported as xlsx from SAS?
I tried that matching up data types (numerical and char fields) of columns and other possible options, but I still do not get a clear explaination on that.
If you have any suspicous part and possible ideas on this, please drop a hint. 🙂
I really appreciate any help from you all.
The modern Office files (since Microsoft finally saw the light and followed the example set by OpenOffice) are groups of xml files contained in a single .zip archive. XML is text with tags, so the files would initially be MUCH larger than ordinary text files, but the zip compression takes care of that very nicely.
How much compression rate you get is dependent on the contents, so there is no clear cut answer.
But note that text files are inherently better for any kind of data transfer. A text file is a text file is a text file, eg the csv format has not changed for decades, and will not change in the foreseeable future. You don't need a special version of some special software to read it, any text editor is sufficient. The same is demonstrably not true for Office files.
xlsx files are actually zip files.
Rename your xlsx file to a zip extension and open it with your decompressor of choice to verify this fact.
Thank you for your tip. I followed that your posting and I figured out the anatomy of excel files.
I believe that the size of a text file is smaller than excel file only except large dataset over a million rows (becuase excel file can store only 2x2^20 - around a million rows).
I think that’s incorrect and definitely more complicated than a simple rule. It’ll also depend on the variable types, number of columns and wether the Excel file has graphs or any ther formatting.
Thank your for your reply.
I understand that the file size is related to many complicate rules as you mentioned.
Since I handled very limited columns and data types (only few text and numbers) in this case, I had a question with a simple idea.
Thank you for your advice!
The modern Office files (since Microsoft finally saw the light and followed the example set by OpenOffice) are groups of xml files contained in a single .zip archive. XML is text with tags, so the files would initially be MUCH larger than ordinary text files, but the zip compression takes care of that very nicely.
How much compression rate you get is dependent on the contents, so there is no clear cut answer.
But note that text files are inherently better for any kind of data transfer. A text file is a text file is a text file, eg the csv format has not changed for decades, and will not change in the foreseeable future. You don't need a special version of some special software to read it, any text editor is sufficient. The same is demonstrably not true for Office files.
From your posting,
I figured out the difference of xlsx file with my dataset(changed to .zip and checked the generated files in xml)
There were several common data in the dataset, and those were written in sharedStrings.xml of the archive.
now I understood that what happend on files.
It answered my questions and I learned a lot.
I appreciate with your tips.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.