BookmarkSubscribeRSS Feed
gauravkodmalwar
Calcite | Level 5

Dear all,

Need some technical help for SPDE experiments. In one proc SQL (with star schema) with BASE SAS, it takes 10 minutes to complete after applying all SGIO settings (bufno, bufsize etc). When I try to run same query but on SPDE lib, it takes same time as earlier. But while defining SPDE library, I didn't specify more than 1 disk but does it really required? Or SPDE useful only with DATA step and not with proc step (star schema)? Please suggest if I should try repeating same experiment but with more disks defined for SPDE library or should I try DATA step?

Since I am a new bee so very tough for me to try DATA step Smiley Happy.

Thanks,

Gaurav

9 REPLIES 9
Kurt_Bremser
Super User

SPDE only makes sense with separate PHYSICAL drives for the "buckets". If you define several directories on the same hardware(disk) and use them for a SPDE library, the single disk still has to process all the I/O, and the I/O bottleneck is not resolved.

Setting up a solid SPDE solution needs to take the number of available disks (or RAID sets) and the number of processor cores into account.

Valid setups look like

2 cores, 4 disks (1 core handles 2 disks during execution)

4 cores, 4 disks

....

generic:

x cores, x*n disks (n starting at 1 and ending when you're no longer I/O bound - no waitstates detectable)

If your SAS server/machine only has one disk, forget SPDE.

gauravkodmalwar
Calcite | Level 5

Thanks Kurt. In one article I saw example where different disks were provided for SPDE indexes & data. In fact, it should have different disk only for data, right? Because major improvement will come when data is equally distributed along multiple disks.

gergely_batho
SAS Employee

In my opinion there are some specific cases, when SPDE engine can be useful even when you have 1 disk:

- Automatic sorting capability of SPDE  (It is rather for convenience, since you can any time run a proc sort or create an index on a base SAS dataset, then you can use the by statement)

- SPDE is able to optimize a query like this: WHERE col1='va1' or col2='val2' ;   - if there is an index on col1 and col2.

link:

When Should You Use the SPDE Engine

Kurt_Bremser
Super User

The automatic sorting of SPDE is actually just a "hoax". In reality, SAS does a very normal sort by itself when you access a SPDE dataset with by ..... And it does that in the standard UTILLOC, and if that is not multiple disk, the sort is not much faster than from a normal dataset.

gauravkodmalwar
Calcite | Level 5

Thanks Kurt & Gergely. In my second experiment, data is distributed across 3 disks and same proc sql (start join) is executed but still performance is not improved.

In my case, IO doesn't seem to be bottleneck because whenever I change SGIO settings, IO changes to higher or lower values with execution time remaining same and same confirmed from perfmon capture results.

In my server case, CPU utilization is seem to be problem. It has 12 cores but it uses only one core steadily during 10 minutes of execution and only once in mid of execution, other cores are used at which sort process runs.

My understanding was that SPDE allows multithreading at core levels but that's not true. After reading your reply carefully, it sound like single IO fetches data from single disk or multiple disk so SPDE is meant to utilize that single IO to read data from all the disks at a time. Thats why SPDE was built to divide data and club together after reading data. But now a days, hardware are of superior at cheap so may be IO problem resolved than it use to be earlier.

gauravkodmalwar
Calcite | Level 5

After looking at process monitor trace, single thread ID is used to read data from all 3 disks and it's red in time slicing e.g. 01:00:00 to 01:00:30 thread reads data from C drive then 01:00:30 to 01:01:00 thread reads data from D drive then 01:01:00 to 01:01:30 thread reads data from E drive so it's a sequential read by that thread from all 3 drives using single core out of 12 available cores. Or may be I am wrong and it's issue with process monitor because it may not be able to show two operations at same time by same thread on different files. Not sure... but no improvement in execution time and remaining cores unused from perfmon data.

Kurt_Bremser
Super User

Since I'm a UNIX guy at heart, I would also suspect that your Windows machine has some limitations on handling parallel I/O. Typical PC's just don't play at the level of pSeries and the like.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1340 views
  • 0 likes
  • 3 in conversation