06-23-2014 04:28 PM
I'm running SAS 9.3 on a Windows 7 64 machine with 24GB of RAM and 2 x Intel Xeon 2.4GHz processors. My jobs seem to take forever. I checked the amount of RAM and CPU usage in the Windows Task Manager, and SAS is using approximately 75MB of RAM and only 13% of the CPU. Note I changed the memsize option in the config file from 2GB to 18GB and also moved up the sortsize to 13.5GB. Why is the RAM/CPU usage so low? Are there some tweaks that I can make to improve performance? Adjust buffersize?
I've attached the output from running proc options.
Thanks very much for any suggestions.
06-23-2014 06:04 PM
SAS performance isn't only CPU and memory usage. Check your SAS logs. How long is forever?! What steps are taking the longest? If its DATA steps and PROC SORTs then it is most likely IO is the culprit.
Run your code with SAS option FULLSTIMER to check memory and IO usage.
Focus on the steps taking longest - can they be re-coded to improve performance?
06-23-2014 07:57 PM
Thanks for your thoughts. I did a simple data step where I appended several files into one file. Now the data files were large (total size of the combined file was 204GB). However, the job took 75 minutes. I can see perhaps 30 minutes for something like this, but over an hour? It seems to be taking too long. Note the hard disk is a Hitachi 7200 RPM. If the issue is IO, how do I fix it? Adjust the BLKSIZE?
set data_exp.policysummarystats2008_ data_exp.policysummarystats2009_ data_exp.policysummarystats2010_
data_exp.policysummarystats2011_ data_exp.policysummarystats2012_ data_exp.policysummarystats2013_
NOTE: There were 30051 observations read from the data set DATA_EXP.POLICYSUMMARYSTATS2008_.
NOTE: There were 25949456 observations read from the data set DATA_EXP.POLICYSUMMARYSTATS2009_.
NOTE: There were 29754796 observations read from the data set DATA_EXP.POLICYSUMMARYSTATS2010_.
NOTE: There were 30672849 observations read from the data set DATA_EXP.POLICYSUMMARYSTATS2011_.
NOTE: There were 32867276 observations read from the data set DATA_EXP.POLICYSUMMARYSTATS2012_.
NOTE: There were 19762534 observations read from the data set DATA_EXP.POLICYSUMMARYSTATS2013_.
NOTE: There were 1535814 observations read from the data set DATA_EXP.POLICYSUMMARYSTATS2014_.
NOTE: The data set DATA_EXP.POLICYSUMMARYSTATSRAW has 140572776 observations and 172 variables.
NOTE: DATA statement used (Total process time):
real time 1:15:51.35
user cpu time 2:43.30
system cpu time 8:36.80
OS Memory 8632.00k
Timestamp 06/23/2014 04:27:35 PM
06-23-2014 08:42 PM
A couple of things to try:
On your DATA statement - data data_exp.policysummarystatsraw (compress = binary);
If compressing the output dataset speeds things up then try compressing your yearly input datasets as well.
If the yearly datasets contain the same variable names try to APPEND the tables like so:
proc datasets library = data_exp;
append base = policysummarystatsraw
data = policysummarystats2008_;
You can only append one dataset at a time but overall it may still be quicker.
06-23-2014 10:10 PM
Thanks for the suggestion. I created a macro using the code you posted. Tried it on a sample of 1000 obs and seems to execute quickly. I'm running it now on all the data and will post the results.
06-24-2014 02:31 AM
You say "the" harddisk, implying you run everything on one disk.
Now to The Math:
204GB filesize means that 204 GB have to be read _and_ written, giving overall total I/O of 408 GB.
408*1024 = 417,792 MB
417,792 / 75 / 60 = 92.84 MB/sec, which is not bad for a single 7200 RPM disk. You are simply massively I/O bound. Especially since you read and write concurrently on the same disk, which further reduces its I/O throughput.
Don't be concerned about memory usage, when doing simple read/write operations SAS does not need more.
What can you do?
- use compression, as suggested. This will increase CPU load (of which you have plenty available) and massively reduce I/O, especially with datasets that contain empty strings etc
- get some REAL storage HW: use 2 or more disks with striping, this will more or less reduce the I/O time by the number of disks. Take caution, failure of one disk will destroy all data! To be on the safe side, set up 2 stripesets and use RAID1 between them
06-25-2014 06:30 PM
Thanks very much for the analysis. The box that I'm running SAS on has 2 HDDs (both 7200 RPM). I'll create a temp sas data folder on the other drive to write large data sets. Hopefully, this will increase performance. I would just purchase an SSD, but my firm plans on upgrading to a server soon.
06-25-2014 06:59 PM
I recommend setting COMPRESS = BINARY in an OPTIONS statement. That way all of your WORK datasets and utility files get compressed too and you don't have to keep repeating COMPRESS = on all required steps.
06-26-2014 02:02 AM
I once had the pleasure of attending a presentation at SEUGI in Florence where the ideal setup of disk storage for SAS was the main theme, and I took the following with me:
If you have a SAN, it is the responsibility of the SAN manager to provide you with enough throughput; can't do much yourself there.
If you work with local disk storage, consider the following:
Is an outage in the DWH service caused by a failed disk tolerable? If not, always use some kind of mirroring, RAID 1 preferred (the fastest method for writing), starting with 2 disk adapters across which you place the mirrors.
Try to saturate the adapter's maximum throughput with disks. Out of the old SCSI days: adapter supports 320M/sec, disks deliver 40m/sec -> 8 disks per adapter; set up striping so that all disks work in parallel and every access causes all disks (in a logical group) to work.
Set up several physically independent volumes and assign those as "special work" libraries. This allows you to do MERGE's where SAS reads infile 1 from disk (group) A, infile 2 from disk (group) B and writes outfile to disk (group) C. You'll be suprised what you can gain with that.
Set the UTILLOC so that it points to a different disk (group) than WORK. This prevents disk contention when doing a sort.
Most of this applies to magnetic storage, SSD's avoid many of the problems caused by latency and seek times, alleviating the effects of disk contention.
06-27-2014 02:02 PM
Thanks very much for the information. I believe that the I/O environment on our new server will be a SAN. Therefore, we'll just have to be sure that we get enough throughput. How much throughput would you recommend for a small shop of SAS users (lt 6) working with data sets in the millions of records (uncompressed large file would be ~200gb)?
06-30-2014 02:55 AM
In the past, I have tried to achieve a consistent data rate of 100 MB/s writing on the SAN, but our workspaces are located on internal disks.
Given that your individual files are much larger than ours, and your SAN may also be required to handle the WORKs, I'd go for 500 MB/s. If somebody asks, just tell them that data warehousing IS an I/O intensive task; where DB systems need to be quick in handling small data chunks randomly, DWH's usually go through _all_ the data in a more or less sequential way.
07-01-2014 09:23 PM
Thank you for your thoughts! At least now I'll know what to request. I'll definitely shoot for 500 MB/s. We'll see what IT says.