09-22-2011 06:56 AM
SAS 9.2 on WIndows 2003 64-bit.
I'm "playing" / experimenting with the SPDE engine. For now, all my data segments are written to the same directory.
I'm converting a 1.1GB base dataset to SPDE format using this code:
libname claimdms spde "X:\scott_temp\claimdm";
proc copy in=claimdm out=claimdms;
I've invoked SAS with the -minpartsize 512M option, and the log shows:
6 %put %sysfunc(getoption(minpartsize));
However, here are the SPDE data segments:
Directory of X:\scott_temp\claimdm
22-Sep-11 20:32 <DIR> .
22-Sep-11 20:32 <DIR> ..
22-Sep-11 20:30 105,908,987 addressdim.dpf.1a801ac4.0.2.spds9
22-Sep-11 20:30 103,789,332 addressdim.dpf.1a801ac4.1.2.spds9
22-Sep-11 20:30 104,341,089 addressdim.dpf.1a801ac4.2.2.spds9
22-Sep-11 20:30 100,655,953 addressdim.dpf.1a801ac4.3.2.spds9
22-Sep-11 20:30 100,480,884 addressdim.dpf.1a801ac4.4.2.spds9
22-Sep-11 20:31 99,355,626 addressdim.dpf.1a801ac4.5.2.spds9
22-Sep-11 20:31 100,634,257 addressdim.dpf.1a801ac4.6.2.spds9
22-Sep-11 20:31 102,279,637 addressdim.dpf.1a801ac4.7.2.spds9
22-Sep-11 20:31 105,780,925 addressdim.dpf.1a801ac4.8.2.spds9
22-Sep-11 20:31 90,204,915 addressdim.dpf.1a801ac4.9.2.spds9
22-Sep-11 20:31 234,209,280 addressdim.hbxaddresskey.1a801ac4.0.2.spds9
22-Sep-11 20:32 95,232,000 addressdim.hbxcovno.1a801ac4.0.2.spds9
22-Sep-11 20:31 8,192 addressdim.idxaddresskey.1a801ac4.0.2.spds9
22-Sep-11 20:32 59,662,336 addressdim.idxcovno.1a801ac4.0.2.spds9
22-Sep-11 20:32 1,793,408 addressdim.mdf.0.0.0.spds9
15 File(s) 1,404,336,821 bytes
So, none of the segments are close to 512M. And I would have expected 2-3 DPF segments.
I've also added PARTSIZE=1024M to the libname statement with similar results.
So how exactly is the MINPARTSIZE meant to work?
09-22-2011 12:26 PM
as you already know MINPARTSIZE must be specified as SAS starts.
You used compression which affected the stored size of the partition.
Without compression, your minpartsize would result in files reported in windows explorer as 524,288 KB
I don't have your data contents so my compression rate was different. When I used character compression, the partitions reduced from 524 MB to sizes varying around 340 MB.
09-22-2011 05:53 PM
D'oh! I should have thought of that.
Hmmm...it begs the question though...should SAS take into account the compression as it creates the data segments?
What I'm trying to do is optimize the data segment size based on the characteristics of our disks, the size of the dataset, the cardinality of the data, and the typical queries.
One of my datasets is 36GB and it created 555 files in the file system directory. Many of the typical queries will return a fair amount of the dataset, say 25% plus.
I admit that much of this does my head in. I think it will mainly be trial and error until I get the best I/O performance for my dataset.
09-22-2011 06:03 PM
SAS has some general guidelines it recommends for partsize in a library. You can find there recommendations in this documet.
The general recommendation is to balance the number of partitions by the number of i/o channels and cpus.
09-23-2011 04:16 AM
for performance, I would suggest more than partitioning through spde.
Perhaps your big data has some natural partitions that affect most queries. One example for a financial environment would be transaction date - where partitioning by business year is natural. SPDS uses a technique called Clusters. Rather than maintain a blob of 35GB as one entity, it is updated in smaller units. Analysis refers to a cluster of smaller tables and the server seamlessly delivers.
Sorting time is not linear by size - and that is the largest time part of re-indexing.
SQL queries will 'pass-thru' opportunities for optimisation with indexes on tables underlying a view.
When doing trials of these alternative approaches remember to use diagnostic option MSGLEVEL=I.
That provides interesting info about the selection/rejection of indexes.
09-23-2011 05:22 AM
I'm not sure I followed all of that...
A colleague suggested breaking our data into separate datasets, partitioned by year, as you suggest above. So, "bigdata.sas7bdat" becomes bigdata_2008, bigdata_209, bigdata_2010, etc. Then, we use views to logically concatenate the physical files. However, the end user would need to select the correct view when generating their queries (the main impetus for this performance tuning is end user interactive queries, such as via EG).
Oracle has the concept of partitioned tables, where you can partition a table as described above, and the query engine intelligently selects the correct partitions based on the where clause.
It would be cool if I could do this via the SPDE engine, i.e 2008 records go into component file 1, 2009 records go into component file 2, etc, and the engine selects the correct partition(s) based on the where clause.
In my testing today, with all component files in a single directory (i.e. no tuning across controllers/disks), I still got significant performance improvements (usually 50% faster or more) using the SPDE engine. I did read that SPDE works better with queries returning a large portion of the data.
Interesting SGF papers:
123-29, Raithel: Good explanation of indexes and when they are (and are not) effective.
bb10, Keintz (NESUG): Creating and using a custom generated index.
A number of papers from SAS employees (Clifford, Brown, Crevar, ...) discussing SPDE and I/O optimization. Unfortunately, I'm reliant on our IT department for I/O optimization, which may or may not happen depending on whether they feel like working that day (just kidding...sort of).
Whatever approach we come up with, I'd like an approach that would have benefits across all applications, such as SPDE, I/O optimizations, BUFSIZE, BUFNO, SGIO, etc, etc. Partitioning by year is an option, but would be best if it could be done transparently, ala Oracle partitioned tables.
Thanks for your reply...
P.S.: We only have SPDE, not SPDS.
09-23-2011 07:56 AM
maintaining your data with automatic partitioning by year isn't too hard to construct, assuming the bulk ukdates the current an previous, while remaining years have only light updates.
If you use an unique index on your spde tables then
proc append base= spdetableX data= changes_for_yearX uniqsave= rep ;
New keys are appended and existing records are replaced = a great update routine!
It is functionality inherited from the SPDS software that is within SPDE.
As it uses the index to maintain the data I don't think it would be appropriate for bulk updates with lots of appends.
Splitting update transactions by "year" provides opportunity for parallel processing of the updates.
In use by analysts, "bigdata.sas7bdat" would effectively be "bigdata.sas7bvew" but they would refer to libname.bigdata without being aware that is view not data.To achieve the "best" performance, you create it as an sql view
proc sql ;
create view lib.bigdata as select * from sp_yr1.bd_yr
outer union corresponding select * from sp_yr2.bd_yr
outer union corresponding select * from sp_yr3.bd_yr
09-23-2011 05:15 PM
Thanks for your reply(ies), much appreciated.
Let me restate what you said above to make sure I understand what you're saying...
1) Use the SPDE engine over the BASE engine for my scenario.
2) Partition my data by year. For example (pseudocode and untested):
libname spde spde "some path";
data spde.mydata_2008 spde.mydata_2009 spde.mydata_2010 spde.mydata_2011;
when(2008) output spde.mydata_2008;
when(2009) output spde.mydata_2009;
when(2010) output spde.mydata_2010;
when(2011) output spde.mydata_2011;
3) Index the separate SPDE datasets "appropriately" based on the typical queries the end users make.
4) If one of those indexes is unique, i.e. the concatenation of the keys uniquely identifies a record, then I can use this code to update/insert *:
proc append base=spde.mydata_2011 data=work.transaction_data_2011 uniqsave=rep;
Don't use this technique if transaction_data_2011 is "large", i.e. it's a "bulk" update.
5) By splitting the data by year, we can parallel process the updates, i.e. schedule multiple SAS transaction load jobs to run concurrently.
6) What my colleague (ok, my boss) suggested was to create separate data step views for ALL the permutations of the years. For example (pseudocode):
v_mydata_all (all years)
etc (I may have missed some combination)
Then, the end user would have to select the correct view when doing their analysis (both in EG, Futrix, and Stored Process Web Application) to get their desired performance. Which IMO is not ideal.
Instead, are you suggesting that a single proc sql view defined as above (outer union corresponding) will get acceptable performance due to the underlying SPDE dataset and related indexes? (That would be great).
Did you mean to use different librefs in your proc sql view, rather than different datasets?
Would it be "smart enough" to stop reading data when a where clause is filtering by year (eg. transaction_year in (2009,2010). What about transaction_year in (2009, 2011), i.e non-consecutive (probably wouldn't happen, but...)
Can you comment on why an SQL view will be more performant than a data step view?
Keep in mind that the main reason for this performance tuning/data restructure is to give better perfornance for interactive queries; we can live with current batch job perfornance, although that could change in the future as data volumes increase. I mention this to emphasize that the easier the architecture is for the end users, the better.
* As an aside, the proc append functionality reminds me of an Oracle "upsert" (update / insert). It's an update technique which first tries an update, then an insert if the key was not found. Google "Oracle upsert" for more details.
Please let me know if I've misunderstood what you've stated above.
09-24-2011 05:10 AM
1 yes (small is better than large and you can tune partsize to fit a whole partition into cache)
2 I would suggest separate libname definitions for each year.
3 yes (and find some way - like collecting and parsing altlogs - to identify the more and less-popular indexes to speed things going forward. Index-usage is reported by option MSGLEVEL=i )
4 unfortunately uniquesave=REP no longers applies in SPDE (my memory is fading, but I thought I it tested OK in SAS9.0. Tests of this fail in 9.2 and 9.3).
The uniquesave= option is a dataset option and the REP value is now valid only in spds and not spde. Check whether the other values for UNIQUESAVE= might help. The manual for SPDE is good in pdf or html at http://support.sas.com/documentation/cdl/en/engspde/61887/HTML/default/viewer.htm#titlepage.htm
5 correct - parallel execution is a major reason to partition data
6 since it is at little extra cost (a view takes little storage) you test whether you need your collegues suggestion. If the year index is chosen for where optimisation then there should be no difference, but as (I think) only one index will be used for where optimisation, the alternatives that your collegue suggested will enable other indexes to deliver significant sub-sets: for example all of a particular product over a subset of years (assuming product is indexed)
With the principle that only one index is going to be used to extract the data, you can see some of the limitations. However, as an SQL view will pass through index optimisation (a data step view is filled before external where clauses are applied - and sql can optimise what it does, like an sql view, and it doesn't know what a data step does) it can pass through the "where year > 2009" kind of test : For the 2008 and 2009 table, no partitions would be opened, only the indexes for those year-based tables.
hope this helps
09-23-2011 11:01 AM
Clustering is not a feature available to the SPD Engine, a lot of the idexing features in SPDS are also not available in the SPDE the specifics of which escape me at the moment.
Need further help from the community? Please ask a new question.