Hello,
Need your help to tackle a space issue in my environment.
We have a 260 GB dataset filling up most of the space on our mountpoint in unix environment and the new data keeps appending to it on a daily basis.The dataset will be further used by SAS portal users to execute stored proceses.
After identifying that the dataset cannot be moved, optimised in any other manner, we would like to take up the idea of feasibility to compress the dataset.
After compression, the 260 GB dataset came down to 54 GB.As we know, indexes cannot be created on compressed datasets.The original uncompressed dataset was indexed.
So, if we were to plan to implement this, the comparision will be in 2 stages.
- First, to compare the real time taken by querying the indexed and unindexed versions of the uncompressed dataset.
- If the above result is satisfactory, we can extend to compare the performance of the real time taken by querying the compressed version of the dataset (without index).
Please help me to understand if the above approach is ok or suggest better ideas further.
I want to
1) mainly tackle the space issue by trying to decrease the dataset size
2) the portal users should still be able to run their queries on the dataset albeit with some decline in performance.
Thank you in advance.
Cheers
Mark.
It appears you have a good handle on the situation. I beg your pardon if these issues have already been dealt with by your group, but they are worth the waste of time:
1. Are you sure you can't reduce the uncompressed size of the dataset? Possibly coding, aggressively shrinking numeric variables, lookup tables?
2. 250 GB isn't a huge amount of storage in an era where I can buy 3 TB at Staples. If user satisfaction is at stake, is an increase in disk space a possibility?
3. Can you remove older records, and hold them in a slower retrieval environment?
4. There are the options of storage products optimized for statistical use of large datasets, but I suspect the price/performance is beyond your budget.
Assuming that none of these help, the only additional comment I can make is that when you do your performance testing on the uncompressed data with no indexes, you should assume that your tests on the compressed data will be inherently faster due to the smaller volumes being processed from disk. The additional CPU time needed to uncompress will almost certainly not consume as much time as the compression will save.
Good luck, let us know what you find out.
Tom
Thank you Tom for helping on this:
 
1. Are you sure you can't reduce the uncompressed size of the dataset? Possibly coding, aggressively shrinking numeric variables, lookup tables?
I will check on shrinking numeric variables.
Our Teradata database stores last 6 months of transactions.This sas dataset is infact the warehouse containing the data of all customers from a long time.Every time the job runs, if fetches data for TD and appends data from time to time to this main dataset.
This is used by portal users as a lookup for all the data related to a customer.
 
2. 250 GB isn't a huge amount of storage in an era where I can buy 3 TB at Staples. If user satisfaction is at stake, is an increase in disk space a possibility?
Yes, 250 GB is not very huge. It was only filling our allocated space and we dont have additional space from other mount points to add to it.
We are in process of migrating to Grid already. So, investment on increasing space wasnt a option to us.
 
3. Can you remove older records, and hold them in a slower retrieval environment?
As per above detail, Cannot remove older records
 
4. There are the options of storage products optimized for statistical use of large datasets, but I suspect the price/performance is beyond your budget.
-
 
Will try further on this and let you know.
Looks like good advice to me @TomKari, aggresively coding, normalising data, creating relationships shoulddecrease the size of the data. Also, good point about disk space, I have recently seen a 16tb SSD drive released, size of a thick credit card, so there really is some heavy storage out there and quick as well.
I don't know where you got the idea that indexes cannot be used with a compressed data set, but either way, that statement is incorrect.
Then I wonder, do the Stored Process really need all this data? Since StP are quite a controlled environment, you tend to design those queries in a way that limits the data throughput.
You seem to already get a fairly good compression rate. If you data have lots of numerical variables, try BINARY compression. With compression, you reduce the I/O. But you pay with CPU.
If your data have lot of data variables, use NUM(4) for those variables.
If not already done so, store the data in a SPDE, where you the features of parallel I/O, and more efficient index management.
I don't know what your application is supposed to do, but having StP do table scans on such "medium" large data sets would not be satisfactory to the end users.
Thank you Linus. Yes, indexes can be created on compressed dataset as well sorry I was only in my initial analysis yesterday before I posted this query without testing from my end.Infact, I came across the statment "Index cannot be created on compressed dataset." in one of the papers http://www.pharmasug.org/proceedings/2016/QT/PharmaSUG-2016-QT23.pdf
Our Teradata database stores last 6 months of transactions.This sas dataset in discussion is infact the warehouse containing the data of all customers from a number of years earlier.Every time the job runs, if fetches data for TD and appends data from time to time to this main dataset. This is used by portal users as a lookup for all the data related to a customer from all the years.
Yes, I have > 80 % numeric variable but strangely BINARY compression for same number of 400 million records resulted in 55.1 GB while CHAR resulted in 53.7 GB . It happens 🙂
First, I wanted to deal with this space issue because it soon going to be filling the mount point and later take up more efficient ways to manage it once stabilised.
-Mark
I come to understand for large datasets that it actually takes a long time for the index file to be created and also to get updated when the dataset is appended.
Just wanted to know which is better and efficient way to handle the index file.
1) to just continue with append of the dataset and let the index file get updated by itself
2) to delete the index and recreate it after appending
Thanks all for your support.
Again: SPDE... 
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
