BookmarkSubscribeRSS Feed
AhmedAl_Attar
Rhodochrosite | Level 12

FG,

Based on your reply, it seems like the Network bandwidth is your bottle neck!! The fact that you are using NAS to host your data set, that means every time you read the data set, you are putting your SAS process under the mercy of the Network, and hoping there is adequate bandwidth left to transfer the data down to your Desktop, so the SAS process can read it!

Regardless whether you add Indexes or not, I'm not sure how much difference it will make!! Unless somebody from your company's Network Infrastructure increase (if even possible) the size of the network pipe between your Desktop and the Storage Disk, you'll not see any significant performance increase!

In my humble opinion, your best short-term solution right now would be:

1. Use multiple directly attached disks (the more spindles, the better)

2. Utilize SPDE and spread the Data Paths across the multiple disks (avoids I/O contentions)

3. Use proper -memsize -sortsize SAS startup options settings

long-term:

Move to Server SAS Install and SAN 🙂

Good luck,

Ahmed

FredGIII
Quartz | Level 8

Ahmed, Chris, et. al.:

Thanks for all your feedback. I feel a bit foolish. During my research it appeared to me that hash tables would have been best and I spent a lot of time trying to figure out that route.  What I had not tried was just creating an index.  I finally did that yesterday. Creating the index took about 12 hrs (long but at least not 3 days!).  After the index was created, I was amazed at how fast subsetting and summarizing went. I did a trial of proc summary for one serial number by day after the index and it only took about 4 minutes!!!!  So, now I have a macro to loop through the serial numbers and I have turned it loose Smiley Happy.    Lesson learned!

stlimpbizkit
Calcite | Level 5

Thanks Fred for sharing. I am interested in this topic as it applies to what I am working on as well.

Do you mind clarify the index method that helped you shorten run time significantly: Is it simply a unique identifier for the purpose of referring to certain row/records? What was the old method (e.g. multiple selection criteria)?

Just try to understand what you did differently pre vs. post. Many thanks.

ballardw
Super User

When you create a new data set you can specify one or more variables to index on.

With a data step if I create a data set and want to index on a SSN it would look like:

data new(index=(ssn));

Or a composite index using more than one variable:

data new (index=(cityst = (city state)));

Refer to a composite index using the index name: cityst

Or you can add an index to an existing data set using Proc Datasets and the INDEX CREATE statement as part of a MODIFY RUN group.

FredGIII
Quartz | Level 8

Oh by the way Ahmed,

The NAS drive is dedicated to just my desktop with a dedicated gigabit ethernet, so I have the full bandwidth available.  But I agree with you, long term, I am really looking forward to the Server SAS /SAN!

AhmedAl_Attar
Rhodochrosite | Level 12

In this case, I would recommend looking into customizing the following SAS settings when you start your local SAS session

  1. Memsize
  2. Realmemsize
  3. Sortsize
  4. BUFNO
  5. BUFSIZE
  6. UTILLOC

Ensure you have the following options set (via Autoexec file)

options FULLSTIMER MSGLEVEL=I;

Check the following papers for additional and more in-depth information

http://support.sas.com/rnd/scalability/papers/practicalperf.pdf

http://support.sas.com/rnd/scalability/papers/solve_perf.pdf

http://support.sas.com/rnd/scalability/papers/

Kurt_Bremser
Super User

- NAS drive

There you have it. Your NAS is (at best) connected with a Gbit interface. That can give you a raw transfer of 100MB/sec. TCP/IP has overhead, on top of that you have the overhead of the SMB protocol, so don't expect much more than 50 MB/sec net throughput.

3TB = 3096 GB = 3,170,304 MB. Divide that by 50, and you get almost a day for just one sequential read. Combine that with the need to write data out to the workspace, and you get a virtual breakdown.

To handle this kind of data, you need a machine with several internal buses, a FC adapter on each bus, and separate drive boxes connected to the FC's. Or a SAN installation on the FC's that spreads the load intelligently.

SAS is mostly dependent on I/O throughput when handling large data, so you need to make sure that every link of the I/O chain saturates its bandwidth.

Oh, and since it may be forgotten: try to store such big datasets with the compress=yes or compress=binary option. If you get a reasonable compression rate, you can trade CPU cycles for I/O.

ChrisNZ
Tourmaline | Level 20

I have yet to come to a case where SPDE (not SPDS, SPDE) is faster than a standard disk array. Proc summary would be just as fast reading from a dataset in a volume spread on a raid 10 disk array as from SPDE tables spread on several volumes on the same disks. Additionally, the array makes it a lot easier to manage data as you have only one volume to work with, and datasets are easier to deal with at the OS level than SPDE tables.

I am not saying SPDE is never useful, but in this case I doubt it. When SPDE tables can be partitioned by say a class variable value instead of simply by size, they might become more useful, but in this case, advising the OP to investigate this will not help imho.

The OP does have a huge I/O problem though when one table scan takes over 58 hours. This has to be solved by better hardware (assuming common sense prevailed regarding field lengths and compression). A DAS is cheap and very quick, just what the doctor ordered for these symptoms.

AhmedAl_Attar
Rhodochrosite | Level 12

Hi Chris,

OK, I agree with you in regards to the Disk Array benefits. But I was not saying SPDE is faster than Disk Arrays. Based on FredGill's messages Disk Array are not an option, as this is a Desktop SAS installation he is dealing with! Having said that, adding SPDE for Large data sets such as the one mentioned here, even if it was stored on optimized Disk Array, would surpass the performance of the SAS Base Engine due to the following reasons:

  • supports more than 32K columns in SAS 9 and later. The base engine supports more than 32K columns in SAS 9.1 and later.
  • is the only SAS engine that supports more than 231 - 1 (approximately 2 billion) rows on 32-bit hosts.
  • supports the implicit sort for BY processing.
  • supports optimization of the WHERE expression with multiple indexes.  (BASE Engine supports single Index)
  • supports optimization of the WHERE expression containing OR.

Hope this clarify the response to you and others,

Regards,

Ahmed

Kurt_Bremser
Super User

SPDE needs physically separate volumes for the "bins" to improve I/O. Running it on a single disk array will not improve anything, as the array already spreads the workload. Using SPDE with a quadcore and 4 arrays (preferably on 4 separate PCIe buses) will improve things. But that needs a real server and not a Windows toybox.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 39 replies
  • 4724 views
  • 6 likes
  • 13 in conversation