Hash Tables - problems with large datasets - Page 3

AhmedAl_Attar · Posted 05-21-2013 12:48 PM

FG,

Based on your reply, it seems like the Network bandwidth is your bottle neck!! The fact that you are using NAS to host your data set, that means every time you read the data set, you are putting your SAS process under the mercy of the Network, and hoping there is adequate bandwidth left to transfer the data down to your Desktop, so the SAS process can read it!

Regardless whether you add Indexes or not, I'm not sure how much difference it will make!! Unless somebody from your company's Network Infrastructure increase (if even possible) the size of the network pipe between your Desktop and the Storage Disk, you'll not see any significant performance increase!

In my humble opinion, your best short-term solution right now would be:

1. Use multiple directly attached disks (the more spindles, the better)

2. Utilize SPDE and spread the Data Paths across the multiple disks (avoids I/O contentions)

3. Use proper -memsize -sortsize SAS startup options settings

long-term:

Move to Server SAS Install and SAN 🙂

Good luck,

Ahmed

FredGIII · Posted 05-21-2013 01:18 PM

Ahmed, Chris, et. al.:

Thanks for all your feedback. I feel a bit foolish. During my research it appeared to me that hash tables would have been best and I spent a lot of time trying to figure out that route. What I had not tried was just creating an index. I finally did that yesterday. Creating the index took about 12 hrs (long but at least not 3 days!). After the index was created, I was amazed at how fast subsetting and summarizing went. I did a trial of proc summary for one serial number by day after the index and it only took about 4 minutes!!!! So, now I have a macro to loop through the serial numbers and I have turned it loose . Lesson learned!

stlimpbizkit · Posted 06-15-2015 12:17 PM

Thanks Fred for sharing. I am interested in this topic as it applies to what I am working on as well.

Do you mind clarify the index method that helped you shorten run time significantly: Is it simply a unique identifier for the purpose of referring to certain row/records? What was the old method (e.g. multiple selection criteria)?

Just try to understand what you did differently pre vs. post. Many thanks.

ballardw · Posted 06-15-2015 12:27 PM

When you create a new data set you can specify one or more variables to index on.

With a data step if I create a data set and want to index on a SSN it would look like:

data new(index=(ssn));

Or a composite index using more than one variable:

data new (index=(cityst = (city state)));

Refer to a composite index using the index name: cityst

Or you can add an index to an existing data set using Proc Datasets and the INDEX CREATE statement as part of a MODIFY RUN group.

FredGIII · Posted 05-21-2013 01:22 PM

Oh by the way Ahmed,

The NAS drive is dedicated to just my desktop with a dedicated gigabit ethernet, so I have the full bandwidth available. But I agree with you, long term, I am really looking forward to the Server SAS /SAN!

AhmedAl_Attar · Posted 05-21-2013 01:51 PM

In this case, I would recommend looking into customizing the following SAS settings when you start your local SAS session

Memsize
Realmemsize
Sortsize
BUFNO
BUFSIZE
UTILLOC

Ensure you have the following options set (via Autoexec file)

options FULLSTIMER MSGLEVEL=I;

Check the following papers for additional and more in-depth information

http://support.sas.com/rnd/scalability/papers/practicalperf.pdf

http://support.sas.com/rnd/scalability/papers/solve_perf.pdf

http://support.sas.com/rnd/scalability/papers/

Kurt_Bremser · Posted 06-16-2015 02:16 AM

- NAS drive

There you have it. Your NAS is (at best) connected with a Gbit interface. That can give you a raw transfer of 100MB/sec. TCP/IP has overhead, on top of that you have the overhead of the SMB protocol, so don't expect much more than 50 MB/sec net throughput.

3TB = 3096 GB = 3,170,304 MB. Divide that by 50, and you get almost a day for just one sequential read. Combine that with the need to write data out to the workspace, and you get a virtual breakdown.

To handle this kind of data, you need a machine with several internal buses, a FC adapter on each bus, and separate drive boxes connected to the FC's. Or a SAN installation on the FC's that spreads the load intelligently.

SAS is mostly dependent on I/O throughput when handling large data, so you need to make sure that every link of the I/O chain saturates its bandwidth.

Oh, and since it may be forgotten: try to store such big datasets with the compress=yes or compress=binary option. If you get a reasonable compression rate, you can trade CPU cycles for I/O.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 05-20-2013 05:40 PM

I have yet to come to a case where SPDE (not SPDS, SPDE) is faster than a standard disk array. Proc summary would be just as fast reading from a dataset in a volume spread on a raid 10 disk array as from SPDE tables spread on several volumes on the same disks. Additionally, the array makes it a lot easier to manage data as you have only one volume to work with, and datasets are easier to deal with at the OS level than SPDE tables.

I am not saying SPDE is never useful, but in this case I doubt it. When SPDE tables can be partitioned by say a class variable value instead of simply by size, they might become more useful, but in this case, advising the OP to investigate this will not help imho.

The OP does have a huge I/O problem though when one table scan takes over 58 hours. This has to be solved by better hardware (assuming common sense prevailed regarding field lengths and compression). A DAS is cheap and very quick, just what the doctor ordered for these symptoms.

High-Performance SAS Coding - Third Edition

AhmedAl_Attar · Posted 05-21-2013 12:20 PM

Hi Chris,

OK, I agree with you in regards to the Disk Array benefits. But I was not saying SPDE is faster than Disk Arrays. Based on FredGill's messages Disk Array are not an option, as this is a Desktop SAS installation he is dealing with! Having said that, adding SPDE for Large data sets such as the one mentioned here, even if it was stored on optimized Disk Array, would surpass the performance of the SAS Base Engine due to the following reasons:

supports more than 32K columns in SAS 9 and later. The base engine supports more than 32K columns in SAS 9.1 and later.
is the only SAS engine that supports more than 2³¹ - 1 (approximately 2 billion) rows on 32-bit hosts.
supports the implicit sort for BY processing.
supports optimization of the WHERE expression with multiple indexes. (BASE Engine supports single Index)
supports optimization of the WHERE expression containing OR.

Hope this clarify the response to you and others,

Regards,

Ahmed

Kurt_Bremser · Posted 06-16-2015 02:02 AM

SPDE needs physically separate volumes for the "bins" to improve I/O. Running it on a single disk array will not improve anything, as the array already spreads the workload. Using SPDE with a quadcore and 4 arrays (preferably on 4 separate PCIe buses) will improve things. But that needs a real server and not a Windows toybox.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Classroom Training Available!