BookmarkSubscribeRSS Feed
deleted_user
Not applicable
Hello all.

First of all, thank you in advance for *any* feedback you can provide on this matter.

I am a business user in a very large bank here in Mexico. I have had built (over 3 years of effort) a technological infrastructure for data mining, modeling and analytics through the use of SAS/Server. The specifications are as follows:

- Unix Server (Sun 10,000) with 6 processors and 12 GB of RAM
- SAS/Server with:
Integration Technologies
SAS BI
IML
Graph
Access/DB2
Access/Oracle
- Access to the server via Enterprise Guide 4.x with 3ms ping to the server
- Files either in DB2 or in flat files
- Around 20 concurrent users

Since it started, all users felt performance was terrible. However, we had not dimensioned how awful it was. So I decided to run a benchmark to compare against SAS/PC.

I took a fairly complex process involving merging, lots of TRANSPOSE, sorting, complicated data steps, macro processing, lots of temporary datasets created, etc. I ran it on the server for two different sets of input datasets - one that is significantly smaller than the average and one that is significantly bigger. Input files for the procedure were all flat SAS files in a library in the server.

I ran the same code in my laptop (Intel Centrino Duo 2.4Ghz, 1 GB RAM, Windows XP, SAS/PC with STAT, ASSIST and SHARE), for both sets of datasets, using flat TXT files and just modified the code to make the import from the TXT file instead of subselecting fields from the SAS dataset.

The results are as follows:


SAS/Server: 61 minutes for the small set, 300 minutes for the large set.
SAS/PC: 2 minutes for the small set, 36 minutes for the large set.


Input files in the "small" set are 17.4 MB, 1.8 MB, 51.0 MB and 8.0 MB. Input files in the "large" set are 94.4 MB, 7.3 MB, 224.2 MB and 34.7 MB.

As additional reference, I ran the same code on my PC at home (Athlon 64 X2 6000+, 4 GB RAM, Vista Ultimate) and the "large" set completed in 9 minutes.

Is there anything I am missing? Without going into detail with configuration issues, optimization constraints, etc., it seems like a heck of a difference. At first glance (2 years ago) I would had sworn codes would actually run *faster* in the server than in my PC. Today I would at least expect the code to run at the same speed. To me, the specs of the server are overkill and codes should fly.

My IT department does not know SAS - every issue is forwarded to SAS Support. Obviously a support optimization has not been run, but I would disagree that this is needed in order to speed up things - such slowness just doesn't make sense, I am sure there is something fundamentally wrong with configuration or otherwise. They have attributed low performance to the way the server is configured and due to the fact some productive applications are also running on that same server. They will apparently be moving soon to a three-layer architecture with a bigger server dedicated exclusively to SAS. However, I have the funny feeling performance will not improve significatively.

Does anyone have any comments, experiences, suggestions or other ideas around this topic? I would really like to have input of other users working on a similar environment. I am very worried since most of our processes that we will be converting to SAS (currently in FoxPro) will run over datasets much bigger than these (think 1-2 GB, millions of records, 100-400 fields). Is performance really going to be *this* bad?

Thank you very much in advance. Message was edited by: biyectivo
41 REPLIES 41
deleted_user
Not applicable
1) At a previous place, we/I had set up an IBM p630 with 4x 1.45 GHz processes and 32 GB of memory for our SAS server. On that box, we would run through data files > 1 GB in minutes. One of my processes whipped through 600M observations in 30 minutes.

SAS is IO intensive.
SAS began on the Mainframe where most data was held on tape, so data was sequentially read in, processed, and then written out.

Disks improved the readily availablility of storage resources, but the processing methodology hasn't changed.

To maximize the IO performance of SAS, here is what I did.

1) Each SAS "group" -- research, data analysis, capacity planning, operations, etc. -- had their own separate 100 GB filesystem for WORK.
2) Each SAS "group" had their own 100 GB filesystem for the SORT volume/device.
3) Each SAS "group" had their own 100 GB to 200 GB of space for their own "proprietary" permanent SAS datasets.

4) The storage was on an Hitachi 99xx storage frame -- today, if new, we would be using a USPV.
5) The 100 GB filesystems were composed of 2x 50 GB LUN's where were striped together at the OS Volume Manager layer.
6) HDLM was then used to load balance the IO across 2 paths.
7) The IO paths were direct connected to the Hitachi frame, not through a SAN, insuring no IO collections (fan in) a the front-side-adapter (FSA) port.

Performance improvements would come from
1) Each group gets their own IO paths
2) Increase the filesystems to use 4x striped LUNs and then use 4x paths per filesystem.

Software tuning of SAS included BUFNO=3, Blksize=64k, asynchio, threads.

Using 64k for the block size maximizes the potential for good IO performance. small IO's can kill an FSA port before the bandwidth of the channel is consumed.

Proper tuning of the OS IO and LUN striping maximizes the opportunity of maximum IO performance for SAS.

SAS is a well written application that can consume all of any servers resources, if the system and SAS are properly tuned.
deleted_user
Not applicable
Thank you very much Chuck. I will surely forward these comments to my IT department so that they can optimize the performance.

Any other comments are deeply appreciated.
1162
Calcite | Level 5
I don't know a lot of technical details, but I found that we had performance issues on our server because of where the files were stored and processed. This relates very much to the previous comments about IO usage.

Originally, our IT Dept. set up our SAS server with very little disk space and told us to use the mapped network drives. Unfortunately, these mappings were so bad (again, I can't explain the technical details) that connections were dropped in the middle of reading or writing files.

When I ran the same procedures on my PC, they ran much faster. I'm sure this was in part because the files were stored locally and SAS wasn't sending data over a network.

I would check whether your server is reading and writing data over a network. Anything you can move onto that physical server would improve performance.
prholland
Fluorite | Level 6
You didn't say whether the times you quoted were elapsed or CPU times. Given the differences I would suspect that they were elapsed times, for the following reasons:

* Your PC has 2 processors running a single job, whereas your server has 6 processors running 20 concurrent jobs.
* Your PC has 1Gb memory used by 1 job, whereas your server has 12Gb shared between 20 concurrent jobs.
* Your PC has locally attached disk space, whereas your server may have networked disk space.
* Finally, if you are measuring the time from submission to receipt of the final output, particularly if the outputs are large, then EG (and any other network-connected client) can take a long(!) time to receive large outputs through the network connection, even though the processing ended much earlier.

If you were to measure the CPU time used on the different platforms (using OPTIONS FULLSTIMER) you may get a different result.

...........Phil
Doc_Duke
Rhodochrosite | Level 12
biyectivo,

In addition to the other posts, I would add that SAS has a Performance Consulting Group who can help you make the best of what you've got or help with ideas on a purchasing decision. They were very helpful to us, both for the diagnosis of the problems on our existing system (and tuning to improve performance) and in providing us the objective information to take to management to get new systems in place.

Doc
prholland
Fluorite | Level 6
> biyectivo,
>
> In addition to the other posts, I would add that SAS
> has a Performance Consulting Group who can help you
> make the best of what you've got or help with ideas
> on a purchasing decision. They were very helpful to
> us, both for the diagnosis of the problems on our
> existing system (and tuning to improve performance)
> and in providing us the objective information to take
> to management to get new systems in place.
>
> Doc

While I'm sure the Performance Consulting Group at SAS was very helpful, I would question whether they could be completely objective. It would be difficult for them to justify to their management any recommendation for the cheapest solution, if it was only marginally less effective than the most efficient solution.

I would recommend contacting the Performance Consulting Group at SAS, but also involving an independent consultant to advise on interpretation of any conclusions given.

........Phil Holland
deleted_user
Not applicable
Thank you all for the replies. I am quoting elapsed times (i.e. end-user perception). I will keep you posted.
deleted_user
Not applicable
Just a quick update. IT migrated to a three-layer architecture, basically dedicating a new server to the SAS product, another for the database server and another for web server and processes. Performance has certainly improved a lot. However I forwarded all the tips in this forum to my IT department and hope they incorporate these into the program.

One question - there is the SORTDEV option which lets you specify a directory for temporary sorting files. Is it advisable to have it in a different directory than WORK? I think the default value is null (which then uses WORK to create these files). Is this correct?

Regards,
deleted_user
Not applicable
Yes.

If you want to maximize performance and are working with large tables (GB files), then SORTDEV should be a separate filesystem from work, using separate LUNs.

I just made a request specifically for this for my own SAS server:

2x 50 GB LUNs striped at 64k per LUN to form a 100 GB filesystem mounted to sas_work
2x 50 GB LUNs striped at 64k per LUN to form a 100 GB filesystem mounted to sas_sort

2x striped LUNs because I have 2 IO paths. There should be one LUN per IO path in a striped set to maximize the potential for maximum throughput. This is in numbers, not assignment. Each LUN should be presented across each path (FSA port) in an active/active load balanced/sharing arrangement/configuration.

Even though SAS supports asynchronous IO, the IO is still mostly sequential, so the striping helps to spread the load across the channels and IO and device drivers.

Also, BUFNO= should be set to at least #paths, and in my case, BUFSIZE=64k. I have not had an opportunity to study if extra buffers help, providing the opportunity to read ahead and write behind. MVS can do this with files with specific JCL commands -- be reading in a block, have a block available for processing, and be writing a block essentially concurrently. When appropriately balanced, processing moves from block to block and IO wait for something to work on is minimized. For SAS and Windows or Unix, I don't know about this one.

The key is separate filesystems using separate LUNs.

You want to avoid storage management providing a single 1 TB LUN because it can, in order to conserve on world-wide names, # of supports LUNs, etc. and then Unix support partitioning this single LUN into 4 partitions, and then striping the pairs of partitions to form a filesystem. You also want to avoid thinking that a separate directory = separate filesystem. NOT guaranteed to be THE SAME.

One of the joys (really and truly, not facetious) of using separate filesystems for these things is that maintenance of them is simple, easy and quick since they only hold transient data. They don't have to backed up! and shouldn't be backed up! Some backup software can differentiate a filesystem, but not a directory for backup. Since these are filesystems, and if the underlying storage is enterprise class boxes (Hitachi 99xx, USP or EMC Symmetrix, DMX ) and if the files are typically multi-gigabyte, then if the server has only a few GB of memory, you can configure the filesystems to use direct IO, by-passing local cache, relying on the external cache of the storage boxes, since there is so much data you'll blow through the local cache anyway (breathe). This will actually increase throughput performance for processing large datasets.

Always remember, with computer systems, sweat the small stuff, because the little details can kill you.
deleted_user
Not applicable
Thank you very much Chuck - this is great. Hope my IT department can implement these best practices.
Kenny
Obsidian | Level 7
We had similar problems. It was due to disk sharing through SAN. We noticed very poor performance and it was traced to some database server doing a backup - completely unrelated to our SAS application and disk setup - or so we thought!

We switched to dedicated clarion disk and we saw immediate improvement in I/O, very massive improvement! To top it off, we had our admin set up a RAM disk (we have 64GB of RAM on an 8 way Linux box.) we placed our most used SAS tables on the RAM disk and performance was nothing short of miraculous.

It's all about proper I/O setup as Chuck has laid out in such great detail!
deleted_user
Not applicable
I am quite concerned about how SAS is deployed in our environment which includes both unix and wintel. I've been wanting to find a set of best practises how SAS can be deployed, and what infrastructure would be suitable, but Chuck's remarks are very helpful. Is it typically fast for EG to run using an external (or copied to local from network location) flat file or connecting to a relational database? Which would cause more network traffic?

-Josh
deleted_user
Not applicable
Flat file vs. Database -- there is no easy answer.

1) This is somewhat politically charged with SAS, as SAS and its datasets and various engines can be used as/for (and technically is) a database.

2) Most relational type database management systems -- Oracle, DB2, etc. -- are more than just an RDMS, but also are transactional engines, providing ACID properties, which include abilities like being able to roll back a transaction. Base SAS, by itself, without explicit user programming, does not do this.

SAS does not necessarily support sharability of data easily either. This is simply evidenced by the fact that yesterday, my production loading of a permanent SAS dataset failed because someone had accessed that file with EG the day before and had left for the night without shutting down their EG.

While SAS does support multiple threads, per se, it is not inherently multi-threaded, as it's heritage is in extremely fast sequential processing on a mainframe.

In a distributed environment of SAS servers and libraries or files and data, it is generally fastest to run initial selects and summarizations on the server that holds the data. What is reported back to EG is always a simple limited subset of the results.

If I were to design a distributed SAS system from scratch -- which I have -- each user gets EG, SAS itself would live on an AIX box, and in the Mainframe sysplex if there is one. I would not necessarily have SAS solely on a Mainframe (topic for a different time/thread). Since some features of SAS work best/only on a Windows box, I would have at least one Windows server. Power users and developers would get PC SAS on their workstation with extra memory and at least one extra hard drive.

So far, it has been my observation that EG is a great ad hoc analysis and reporting tool, but has many shortcomings for regularly sheduled batch processing. That's why you need SAS developers for creating those things.

My limited experience to date with SAS's MetaData Server is not a happy one. But, a properly administrated meta-data repository for EG is essential and very helpful. This has to be on a separate server, and I believe it still has to be a Windows box (ugh).

The only things possibly faster than SAS at reading a flat file may be a well written compiled COBOL or C program, maybe. But, it'll take a programmer longer to write these COBOL or C programs than it will a SAS programmer, and SAS is superiorly flexible at it.

I am a firm believer that data should be placed into an RDMS to provide ready access to many different application programs. Only SAS can read SAS datasets, as far as I know. RDMS's provide better security controls, transactional process controls, etc. While much of this can be done with/in SAS, it is a lot more work because the core competency of SAS is data analysis of large datasets and simplifying the execution of highly sophisticated statistical analyses.

All kinds of things can be done RDMS systems to improve their performance from many aspects. Modern RDMS's are highly threaded and will outperform SAS on initial queries/views of complicated data structure tables. Indexing is easier to manage and maintain. Data integrity is easier to mainain. All sorts of things.

But, after I have pulled the data I want from the database, I want it on the box that is running SAS, and no DMBS nor its language is going to allow me to process that data faster for the results I need, and the programming, If I had tried, would have been a lot harder, or in some cases impossible (Oracle's PL/SQL just can't do some things that SAS can -- but, it wasn't supposed to in the first place).

SAS can be used to load a database -- using SAS for ETL.
Some prefer to use a special purpose ETL tool like Informatica.
I tended to use Oracle's SQLLoader.
This issue is non-trivial, and no matter how things are sliced and diced, proper ETL is a messy complicated thankless yuck job, especially when the data/process blows up -- who can think of all the Murphyisms? and then there's the issue of the human process(es) involved in resolving them.
If you are lucky, all business transaction processing is already in an RDMS and you can just simply access it.

Minimally, the EG metadata repository will define users, put the users into groups, and define a logical SAS server for each group. These logical group servers will have access to only that group's data (generally their private PSDS's and libaries). Then there will be a logical SAS server that has publicly available libaries assigned to it.

For example:
Mainframe = SAS_Main and has libraries defined for SMF records.
SAS_Public = on AIX box and contains RDMS library defintions for the Data Warehouse.
Marketing = on AIX box, home = /home/marketing. LDAP tools allow only employees of the Marketing department to use Marketing and to access /home/marketing on the box when they log in. /home/marketing is a 200 GB filesystem. It contains two other independent filesystems /home/marketing/sas_work and /home/marketing/sas_sort, each are 100 GB.
Research = on AIX box, home = /home/research. ditto Marketing
Operations = on AIX box, home= /home/operations. ditto Marketing
Business Analytics = BusAnals = on Windows box, home = \\sasprd\Business Analytics. Still ditto Marketing, with the difference that Business Analytics is a share on the box, the real base directory is F:\sas_users\Business Analytics. Business Analytics is on a Windows server because they can't let go of their use of Excel spreadsheets.

Each of these logical servers has its base access directory as the home directory, not the system directory, for security reasons.

The next rule, is that there is a standing rule, that the first item in each project is an "Assign Library" creating EGTASK that points to a directory (folder) specifically and only for that project, to hold the project file, and all the generated datasets. This provides for significantly easier disk space maintenance.

Sorry for the small book, but I hope this helps. Message was edited by: Chuck
deleted_user
Not applicable
I infer then that it is faster to run with a flat file (located near the sas server) than to do so with the data accessed via odbc, though an oracle database might be able to manage the data more efficiently. But more so, it seems that a flat file is more portable (in a multi-user environment where alot of users are doing their own thing but running off the same flat file). Thats my biggest worry, how to reduce the IO load when users cannot afford to wait for each other to run their own program, but most of them need the same flat files. I was reading Chris Hemedinger's SAS for dummies, and it seems to suggest that EG runs faster with pre-defined data libraries instead of calling a physical file at the EG. But what it does not explicitly say is whether physical files called at EG (whether it exists on a local disk or a network directory) is whether the physical data has to travel from its src to the sas server, or whether the physical data is first routed to the EG before being piped to the server tier.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Creating Custom Steps in SAS Studio

Check out this tutorial series to learn how to build your own steps in SAS Studio.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 41 replies
  • 10021 views
  • 1 like
  • 6 in conversation