03-14-2015 01:39 PM
Need some technical help for SPDE experiments. In one proc SQL (with star schema) with BASE SAS, it takes 10 minutes to complete after applying all SGIO settings (bufno, bufsize etc). When I try to run same query but on SPDE lib, it takes same time as earlier. But while defining SPDE library, I didn't specify more than 1 disk but does it really required? Or SPDE useful only with DATA step and not with proc step (star schema)? Please suggest if I should try repeating same experiment but with more disks defined for SPDE library or should I try DATA step?
Since I am a new bee so very tough for me to try DATA step .
03-16-2015 02:56 AM
SPDE only makes sense with separate PHYSICAL drives for the "buckets". If you define several directories on the same hardware(disk) and use them for a SPDE library, the single disk still has to process all the I/O, and the I/O bottleneck is not resolved.
Setting up a solid SPDE solution needs to take the number of available disks (or RAID sets) and the number of processor cores into account.
Valid setups look like
2 cores, 4 disks (1 core handles 2 disks during execution)
4 cores, 4 disks
x cores, x*n disks (n starting at 1 and ending when you're no longer I/O bound - no waitstates detectable)
If your SAS server/machine only has one disk, forget SPDE.
03-16-2015 03:25 AM
Oh, and to answer your original question:
SPDE libraries are transparent to SAS code. They can be used anywhere you use an "ordinary" library.
03-16-2015 03:52 AM
Thanks Kurt. In one article I saw example where different disks were provided for SPDE indexes & data. In fact, it should have different disk only for data, right? Because major improvement will come when data is equally distributed along multiple disks.
03-16-2015 04:03 AM
In my opinion there are some specific cases, when SPDE engine can be useful even when you have 1 disk:
- Automatic sorting capability of SPDE (It is rather for convenience, since you can any time run a proc sort or create an index on a base SAS dataset, then you can use the by statement)
- SPDE is able to optimize a query like this: WHERE col1='va1' or col2='val2' ; - if there is an index on col1 and col2.
03-16-2015 05:07 AM
The automatic sorting of SPDE is actually just a "hoax". In reality, SAS does a very normal sort by itself when you access a SPDE dataset with by ..... And it does that in the standard UTILLOC, and if that is not multiple disk, the sort is not much faster than from a normal dataset.
03-17-2015 02:17 AM
Thanks Kurt & Gergely. In my second experiment, data is distributed across 3 disks and same proc sql (start join) is executed but still performance is not improved.
In my case, IO doesn't seem to be bottleneck because whenever I change SGIO settings, IO changes to higher or lower values with execution time remaining same and same confirmed from perfmon capture results.
In my server case, CPU utilization is seem to be problem. It has 12 cores but it uses only one core steadily during 10 minutes of execution and only once in mid of execution, other cores are used at which sort process runs.
My understanding was that SPDE allows multithreading at core levels but that's not true. After reading your reply carefully, it sound like single IO fetches data from single disk or multiple disk so SPDE is meant to utilize that single IO to read data from all the disks at a time. Thats why SPDE was built to divide data and club together after reading data. But now a days, hardware are of superior at cheap so may be IO problem resolved than it use to be earlier.
03-17-2015 02:27 AM
After looking at process monitor trace, single thread ID is used to read data from all 3 disks and it's red in time slicing e.g. 01:00:00 to 01:00:30 thread reads data from C drive then 01:00:30 to 01:01:00 thread reads data from D drive then 01:01:00 to 01:01:30 thread reads data from E drive so it's a sequential read by that thread from all 3 drives using single core out of 12 available cores. Or may be I am wrong and it's issue with process monitor because it may not be able to show two operations at same time by same thread on different files. Not sure... but no improvement in execution time and remaining cores unused from perfmon data.
03-17-2015 02:42 AM
Which SAS version are you using?
Also take a look at the
options in proc options.
Maybe you need to set one of those.
03-17-2015 02:45 AM
Since I'm a UNIX guy at heart, I would also suspect that your Windows machine has some limitations on handling parallel I/O. Typical PC's just don't play at the level of pSeries and the like.