BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ScottBass
Rhodochrosite | Level 12

Hi,

SAS 9.2 on WIndows 2003 64-bit.

I'm "playing" / experimenting with the SPDE engine.  For now, all my data segments are written to the same directory.

I'm converting a 1.1GB base dataset to SPDE format using this code:

%put %sysfunc(getoption(minpartsize));

options compress=binary;

libname claimdms spde "X:\scott_temp\claimdm";

proc copy in=claimdm out=claimdms;
  select addressdim;
run;

I've invoked SAS with the -minpartsize 512M option, and the log shows:

6 %put %sysfunc(getoption(minpartsize));

536870912

However, here are the SPDE data segments:

X:\scott_temp\claimdm>dir
Directory of X:\scott_temp\claimdm

22-Sep-11  20:32    <DIR>          .
22-Sep-11  20:32    <DIR>          ..
22-Sep-11  20:30       105,908,987 addressdim.dpf.1a801ac4.0.2.spds9
22-Sep-11  20:30       103,789,332 addressdim.dpf.1a801ac4.1.2.spds9
22-Sep-11  20:30       104,341,089 addressdim.dpf.1a801ac4.2.2.spds9
22-Sep-11  20:30       100,655,953 addressdim.dpf.1a801ac4.3.2.spds9
22-Sep-11  20:30       100,480,884 addressdim.dpf.1a801ac4.4.2.spds9
22-Sep-11  20:31        99,355,626 addressdim.dpf.1a801ac4.5.2.spds9
22-Sep-11  20:31       100,634,257 addressdim.dpf.1a801ac4.6.2.spds9
22-Sep-11  20:31       102,279,637 addressdim.dpf.1a801ac4.7.2.spds9
22-Sep-11  20:31       105,780,925 addressdim.dpf.1a801ac4.8.2.spds9
22-Sep-11  20:31        90,204,915 addressdim.dpf.1a801ac4.9.2.spds9
22-Sep-11  20:31       234,209,280 addressdim.hbxaddresskey.1a801ac4.0.2.spds9
22-Sep-11  20:32        95,232,000 addressdim.hbxcovno.1a801ac4.0.2.spds9
22-Sep-11  20:31             8,192 addressdim.idxaddresskey.1a801ac4.0.2.spds9
22-Sep-11  20:32        59,662,336 addressdim.idxcovno.1a801ac4.0.2.spds9
22-Sep-11  20:32         1,793,408 addressdim.mdf.0.0.0.spds9
              15 File(s)  1,404,336,821 bytes

So, none of the segments are close to 512M.  And I would have expected 2-3 DPF segments.

I've also added PARTSIZE=1024M to the libname statement with similar results.

So how exactly is the MINPARTSIZE meant to work?

Thanks,

Scott


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
1 ACCEPTED SOLUTION

Accepted Solutions
Peter_C
Rhodochrosite | Level 12

perhaps compress=binary applies after the effect of minpartsize

View solution in original post

11 REPLIES 11
Peter_C
Rhodochrosite | Level 12

perhaps compress=binary applies after the effect of minpartsize

Peter_C
Rhodochrosite | Level 12

as you already know MINPARTSIZE must be specified as SAS starts.

You used compression which affected the stored size of the partition.

Without compression, your minpartsize would result in files reported in windows explorer as 524,288 KB

I don't have your data contents so my compression rate was different. When I used character compression, the partitions reduced from 524 MB to sizes varying around 340 MB.

.

FriedEgg
SAS Employee

Peter is correct, the partition is established before the compression is applied.

ScottBass
Rhodochrosite | Level 12

D'oh!  I should have thought of that.

Hmmm...it begs the question though...should SAS take into account the compression as it creates the data segments?

What I'm trying to do is optimize the data segment size based on the characteristics of our disks, the size of the dataset, the cardinality of the data, and the typical queries.

One of my datasets is 36GB and it created 555 files in the file system directory.  Many of the typical queries will return a fair amount of the dataset, say 25% plus.

I admit that much of this does my head in.  I think it will mainly be  trial and error until I get the best I/O performance for my dataset. 


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
FriedEgg
SAS Employee

SAS has some general guidelines it recommends for partsize in a library.  You can find there recommendations in this documet.

http://support.sas.com/documentation/cdl/en/engspde/61887/PDF/default/engspde.pdf

The general recommendation is to balance the number of partitions by the number of i/o channels and cpus.

Peter_C
Rhodochrosite | Level 12

for performance, I would suggest more than partitioning through spde.

Perhaps your big data has some natural partitions that affect most queries. One example for a financial environment would be transaction date - where partitioning by business year is natural. SPDS  uses a technique called Clusters. Rather than maintain a blob of 35GB as one entity, it is updated in smaller units. Analysis refers to a cluster of smaller tables and the server seamlessly delivers.

Sorting time is not linear by size - and that is the largest time part of re-indexing.

SQL queries will 'pass-thru' opportunities for optimisation with indexes on tables underlying a view.

When doing trials of these alternative approaches remember to use diagnostic option MSGLEVEL=I.

That provides interesting info about the selection/rejection of indexes.

Good luck

peterC

ScottBass
Rhodochrosite | Level 12

Hi Peter,

I'm not sure I followed all of that...

A colleague suggested breaking our data into separate datasets, partitioned by year, as you suggest above.  So, "bigdata.sas7bdat" becomes bigdata_2008, bigdata_209, bigdata_2010, etc.  Then, we use views to logically concatenate the physical files.  However, the end user would need to select the correct view when generating their queries (the main impetus for this performance tuning is end user interactive queries, such as via EG).

Oracle has the concept of partitioned tables, where you can partition a table as described above, and the query engine intelligently selects the correct partitions based on the where clause.

It would be cool if I could do this via the SPDE engine, i.e 2008 records go into component file 1, 2009 records go into component file 2, etc, and the engine selects the correct partition(s) based on the where clause.

In my testing today, with all component files in a single directory (i.e. no tuning across controllers/disks), I still got significant performance improvements (usually 50% faster or more) using the SPDE engine.  I did read that SPDE works better with queries returning a large portion of the data.

Interesting SGF papers:

123-29, Raithel:  Good explanation of indexes and when they are (and are not) effective.

bb10, Keintz (NESUG):  Creating and using a custom generated index.

A number of papers from SAS employees (Clifford, Brown, Crevar, ...) discussing SPDE and I/O optimization.  Unfortunately, I'm reliant on our IT department for I/O optimization, which may or may not happen depending on whether they feel like working that day (just kidding...sort of).

Whatever approach we come up with, I'd like an approach that would have benefits across all applications, such as SPDE, I/O optimizations, BUFSIZE, BUFNO, SGIO, etc, etc.  Partitioning by year is an option, but would be best if it could be done transparently, ala Oracle partitioned tables.

Thanks for your reply...

Scott

P.S.: We only have SPDE, not SPDS.


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
Peter_C
Rhodochrosite | Level 12

maintaining your data with automatic partitioning by year isn't too hard to construct, assuming the bulk ukdates the current an previous, while remaining years have only light updates.

If you use an unique index on your spde tables then

proc append base= spdetableX data= changes_for_yearX uniqsave= rep ;

run;

New keys are appended and existing records are replaced = a great update routine!

It is functionality inherited from the SPDS software that is within SPDE.

As it uses the index to maintain the data I don't think it would be appropriate for bulk updates with lots of appends.

Splitting update transactions by "year" provides opportunity for parallel processing of the updates.

In use by analysts,  "bigdata.sas7bdat" would effectively be "bigdata.sas7bvew" but they would refer to libname.bigdata without being aware that is view not data.To achieve the "best" performance, you create it as an sql view

proc sql  ;

create view lib.bigdata  as select * from sp_yr1.bd_yr

outer union corresponding select * from sp_yr2.bd_yr

outer union corresponding select * from sp_yr3.bd_yr

;

quit ;

ScottBass
Rhodochrosite | Level 12

Hi Peter,

Thanks for your reply(ies), much appreciated.

Let me restate what you said above to make sure I understand what you're saying...

1) Use the SPDE engine over the BASE engine for my scenario.

2) Partition my data by year.  For example (pseudocode and untested):

libname spde spde "some path";

data spde.mydata_2008 spde.mydata_2009 spde.mydata_2010 spde.mydata_2011;

set libref.mydata;

select(transaction_year);

when(2008) output spde.mydata_2008;

when(2009) output spde.mydata_2009;

when(2010) output spde.mydata_2010;

when(2011) output spde.mydata_2011;

end;

run;

3) Index the separate SPDE datasets "appropriately" based on the typical queries the end users make.

4) If one of those indexes is unique, i.e. the concatenation of the keys uniquely identifies a record, then I can use this code to update/insert *:

proc append base=spde.mydata_2011 data=work.transaction_data_2011 uniqsave=rep;

run;

Don't use this technique if transaction_data_2011 is "large", i.e. it's a "bulk" update.

5) By splitting the data by year, we can parallel process the updates, i.e. schedule multiple SAS transaction load jobs to run concurrently.

6) What my colleague (ok, my boss) suggested was to create separate data step views for ALL the permutations of the years.  For example (pseudocode):

v_mydata_all (all years)

v_mydata_2008_2009_2010

v_mydata_2008_2009

v_mydata_2009_2010_2011

v_mydata_2009_2010

v_mydata_2010_2011

etc (I may have missed some combination)

Then, the end user would have to select the correct view when doing their analysis (both in EG, Futrix, and Stored Process Web Application) to get their desired performance.  Which IMO is not ideal.

Instead, are you suggesting that a single proc sql view defined as above (outer union corresponding) will get acceptable performance due to the underlying SPDE dataset and related indexes?  (That would be  great).

Did you mean to use different librefs  in your proc sql view, rather than different datasets?

Would it be "smart enough" to stop reading data when a where clause is filtering by year (eg. transaction_year in (2009,2010).  What about transaction_year in (2009, 2011), i.e non-consecutive (probably wouldn't happen, but...)

Can you comment on why an SQL view will be more performant than a data step view?

Keep in mind that the main reason for this performance tuning/data restructure is to give better perfornance for interactive queries; we can live with current batch job perfornance, although that could change in the future as data volumes increase.  I mention this to emphasize that the easier the architecture is for the end users, the better.

* As an aside, the proc append functionality reminds me of an Oracle "upsert" (update / insert).  It's an update technique which first tries an update, then an insert if the key was not found.  Google "Oracle upsert" for more details.

Please let me know if I've misunderstood what you've stated above.

Thanks,

Scott


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
Peter_C
Rhodochrosite | Level 12

Scott

1 yes (small is better than large and you can tune partsize to fit a whole partition into cache)

2 I would suggest separate libname definitions for each year.

  • It would allow you to get any benefit (like throughput capacity) available from concurrent reading from separate physical drives/spindles
  • it ensures a "bounded" capacity per folder (i.e. year) so administration is not coping with the ever-growing list within the folder
  • when coding updates or any other process for just one year, just one body of code is needed for each3

3 yes (and find some way - like collecting and parsing altlogs - to identify the more and less-popular indexes to speed things going forward. Index-usage is reported by option MSGLEVEL=i )

4 unfortunately uniquesave=REP no longers applies in SPDE (my memory is fading, but I thought I it tested OK in SAS9.0. Tests of this fail in 9.2 and 9.3).

The uniquesave= option is a dataset option and the REP value is now valid only in spds and not spde. Check whether the other values for UNIQUESAVE= might help. The manual for SPDE is good in pdf or html at http://support.sas.com/documentation/cdl/en/engspde/61887/HTML/default/viewer.htm#titlepage.htm

5 correct - parallel execution is a major reason to partition data

6 since it is at little extra cost (a view takes little storage) you test whether you need your collegues suggestion. If the year index is chosen for where optimisation then there should be no difference, but as (I think) only one index will be used for where optimisation, the alternatives that your collegue suggested will enable other indexes to deliver significant sub-sets: for example all of a particular product over a subset of years (assuming product is indexed)

With the principle that only one index is going to be used to extract the data, you can see some of the limitations. However, as an SQL view will pass through index optimisation (a data step view is filled before external where clauses are applied - and sql can optimise what it does, like an sql view, and it doesn't know what a data step does) it can pass through the "where year > 2009"  kind of test : For the 2008 and 2009 table, no partitions would be opened, only the indexes for those year-based tables.

hope this helps

Peter

FriedEgg
SAS Employee

Clustering is not a feature available to the SPD Engine, a lot of the idexing features in SPDS are also not available in the SPDE the specifics of which escape me at the moment.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1435 views
  • 0 likes
  • 3 in conversation