topic Re: Advantage and disadvantage of hash table? in SAS Programming

Advantage and disadvantage of hash table?

gyambqt — Mon, 30 Mar 2015 06:27:00 GMT

Hello Experts:

I have few questions want to ask about hash table.

1.When use hash table to look up matched records, do records in hash table scanned sequentially from top to bottom or it will jump to the corresponding key value straight just like index?

2.What is the advantage and disadvantage of hash table compare to proc sql and data step merge?

3.To be updated in the future...

Thanks

Re: Advantage and disadvantage of hash table?

Kurt_Bremser — Mon, 30 Mar 2015 06:33:32 GMT

1. You can work through a hash object sequentially and with keys.

2. The advantage of a hash object is that you can avoid lots of I/O on one of the input tables (if you don't do a straight merge); the disadvantage is that you will be limited by the available memory.

Re: Advantage and disadvantage of hash table?

Ksharp — Mon, 30 Mar 2015 12:52:05 GMT

1. it will jump to the corresponding key value straight just like index. In Hash Map(in Java, old catchphrase is Hash Table), it is called bucket which can decide the speed of querying . more buckets more fast . You could define it when constructing a Hash Map like : declare hash h(hashexp:20); 20 is the maximize value of buckets - 2^20 .

2. Hash Map is a querying tool by using space(memory) to exchange time . therefore disadvantage is memory limitation. and hard to handle the continuous value (double value).

Xia Keshan

Re: Advantage and disadvantage of hash table?

gyambqt — Mon, 30 Mar 2015 22:35:53 GMT

Thanks a lot !!!!

Re: Advantage and disadvantage of hash table?

gyambqt — Wed, 01 Apr 2015 03:59:24 GMT

Hi KurtBremser,

I was bit confused why Hash can improve I/O process?

assume you have two tables want to merge each table contain 50k records. Table A and Table B and input buffer can only take 10k records each time.

if you use data step merge then 50k+50k records will be loaded to input buffer 10 times ((50k+50k)/10k=10 times)

50k+50k on hard disk---->Input buffer----->PDV---->etc.

if you use hash to hold table A, then it takes 5 times(50k/10k=5times) for data records on table A to load to input buffer then to HASH table. it will also take 5 times (50k/10k=5times) for data records in table B to be loaded to input buffer. so the total is 5+5=10 times. there is no difference....

50k records in table A on hard disk---->input buffer---->PDV----->HASH

50k records in table B on hard disk-----> input buffer-----PDV--->etc.

Please explain.

Re: Advantage and disadvantage of hash table?

Kurt_Bremser — Wed, 01 Apr 2015 06:35:45 GMT

Imagine a situation where you have to match several records of dataset B to a record of dataset A, on a complex condition that can't be solved by simply sorting through the datasets an merging them along the sort.

Think of the fuzzy logic needed to match actual inputs of your users to the normalized names of, say, car types.

In that case you will have to read more or less randomly through dataset B for every record of dataset A.

If you loaded dataset B into a hash object at _n_ = 1, you have the complete I/O from dataset B only once, and sequentially at that.

Another use that I encounterd quite often is when you have to determine the actual work hours spent for a service call between initial call and problem solved (Saturday and Sunday don't count, Monday to Thursday 7am to 5pm, Friday 7am to 3pm, holidays also don't count).

Re: Advantage and disadvantage of hash table?

jakarman — Wed, 01 Apr 2015 07:51:50 GMT

Ah another duplicate question. not really duplicated (2/ ) but having the same root cause.https://communities.sas.com/thread/74705 as of optimizing memory dim (near to processors) and memory far away (dasd).

for 1/ which approach to do, you can choose to program any of those two (indexed like Btree or sequential)

for 3/ you can add/delete data/keys while using the hashes in a streamed data processing approach. Not possible with the others.

Re: Advantage and disadvantage of hash table?

gyambqt — Wed, 01 Apr 2015 08:23:10 GMT

Hi KurtBremser,

I am trying to understand you with the following example, please let me know if I am wrong.

Assume we have two SAS datasets, A and B (they are not sorted based on the by variables)

We want to match A and B, but every record of A can match multiple records in B so it is One to Many match.

If I use data step merge statement, then many records from A will be selected and loaded to a buffer (memory) and many records from B will be selected and loaded a buffer (memory).

If records in the buffer from A is NOT found in the buffer for records in B, then the buffer for B will be empty and another set of records will be selected from B into the buffer to match the records for A (in the buffer).

So this step will be repeated until all the matching records are found, so it can cause a lot of I/O (read data from disk).

But Hash can solve this problem because it holds B so there is only one I/O to match with records from A( the number of I/O will be equal to number of records from A).

Am I correct?

If I am wrong please let me know why.

Maybe my understanding of buffer and I/O is wrong...

Re: Advantage and disadvantage of hash table?

jakarman — Wed, 01 Apr 2015 08:39:02 GMT

gyambqt, There is no buffering involved in the merge logic only the current record-pointer as retrieved from the datasets is used.
You can use the point= SAS(R) 9.4 Statements: Reference, Third Edition to move that record pointer around in the dataset. A very weird but possible effective way to handle datamerging. When the question is merging with a time-window that is present in the to be combined ordered data.

Merging can be done with notsorted data. SAS(R) 9.4 Statements: Reference, Third Edition (the by has a notsorted option).
Sorted data is well ordered data and ordered data is giving the best results with merging. But ordered does not need to be sorted.
Sounds strange the best example is a sorted dataset coming from an other environment with different encoding. That data is still nicely ordered but could have issues as not being sorted in the current encoding.

Re: Advantage and disadvantage of hash table?

gyambqt — Wed, 01 Apr 2015 08:57:04 GMT

Hi Jaap,

Do you mean in the data merge the data records for both datasets are loaded to PDV directly for matching? i thought the record would be loaded to buffer first then from buffer load to pdv for matching.

Re: Advantage and disadvantage of hash table?

Kurt_Bremser — Wed, 01 Apr 2015 09:13:55 GMT

Forget your concern with the buffering, because you can NOT influence the cacheing that SAS does, apart from how much memory you grant it through MEMSIZE.

Anytime SAS accesses dataset(s), it tries to do some read-ahead etc based on its own logic, but you cannot influence that.

What you can influence is the number of logical reads to a certain object; the hash table is a method to guarantee that a certain object is read only once from disk and then kept in memory, now matter how often and in what sequence you access it.

When you have a simple 1 to 1 or 1 to n relationship based on a given key, sorting and merging is as simple as it can get. During the merge, SAS will go sequentially through the datasets, and perform some read-ahead based on its own memory parameters; meanwhile, the OS will also read ahead into the persistent system memory according to its own educated guesses based on process behaviour. Since the OS knows about all processes curently running (your SAS process only knows about itself), it is best to leave most of the work to the OS by keeping SAS memsize small in a multiuser environment.

When you need to deal with a complex and mostly unpredictable relationship (eg reading a start and end date and match that with parameters about a given time range, like holidays - even in different regions!), then a hash table is a good method to prevent multiple random reads to the lookup table (or even multiple passes through the whole table), which will completely skewer all attempts by the OS to make intelligent decisions what to cache and what not for optimum system performance.

Randomized reads are also the primary cause for wait states in the storage subsystem, and bad overall performance.

Re: Advantage and disadvantage of hash table?

gyambqt — Wed, 01 Apr 2015 09:38:45 GMT

Thanks for your reply.

I was confused by the following link indeed:

http://web.utk.edu/sas/OnlineTutor/1.2/en/60477/m81/m81_2.htm

It said

SAS copies the data from the input data set to a buffer in memory

one observation at a time is loaded into the program data vector

each observation is written to an output buffer when processing is complete

the contents of the output buffer are written to the disk when the buffer is full

Re: Advantage and disadvantage of hash table?

jakarman — Wed, 01 Apr 2015 09:45:21 GMT

Agree with Kurt. His remark to random access to the OS filesystem is an important one. It will diminish using SSD's but with the rotating harddisk-s those mechanical delays are the reason why a SQL approach can be slower the sorting and merging.
Seen the turnaround for that a 10-20% partially data-access above that sequential is performing faster. Remember RDBMS with SQL where designed for OLTP and not for big data processing. Only at the moment of mass parallel processing SQL can gain in speed/performance. That is bringing you to Parallel computing - Wikipedia, the free encyclopedia (HAdoop Teradata grid)

The PDV is only a SAS-datastep concept that is existent AFTER all data has come in.
The buffering of those records is not being controlled easily although you can do some things. As where and keep processing is done on those records indepently (dataset options). SAS(R) 9.4 Data Set Options: Reference, Second Edition

Just see those and realize they are outside the PDV scope. http://www.sas.com/content/dam/SAS/en_ca/User%20Group%20Presentations/TASS/Mehatab-DataStepPDV.pdf A nice presentation shows the buffer as interaction to your OS, all your data processing with the datastep on using the PDV. The where statement is confusing as it moved outside the PDV and executes before that. This is different to IF and Select/where as they process all data seen by the PDV. For performance it is the best no data that is needed coming in seen in the PDV. When optimizing outgoing data you can drop/keep those before getting out.

Re: Advantage and disadvantage of hash table?

gyambqt — Wed, 01 Apr 2015 10:00:31 GMT

I have read the powerpoint about PDV, the buffer is still existed but according to kurt remarkthere is no good way to control how it read the data records on hard disk.

I have read other people's remark, for 1 to 1 matching, there is no differencn between data step merge and hash table match in the aspect of i/o.

since dataset are accessed sequentially using data step merge. Shouldn't it cause more I/o than hash table match?

Re: Advantage and disadvantage of hash table?

jakarman — Wed, 01 Apr 2015 10:12:46 GMT

The datastep merge by is NOT randomly accessing the data but does I sequentially. Sequentially access is for the OS (buffering) very predictable.
The overhead is caused by getting the data ordered first (the sorting). The intermediate approach is using indexes. The index is sorted but the data that is retrieved isn't. The most bad approach will be a sequential search from start for every record.

The most efficient one is "balance line" that is why that one is important and can be found in a RDBMS (explain paths etc).

Typical indexed access will cause a random IO pattern (normal with SQL). This one is the same as point= and indexed by usage with a SAS datastep.

Hash will avoid all IO but you can still do a "balance line" or index or restart from first obs with that. That is your choice using the hashes.

Re: Advantage and disadvantage of hash table?

gyambqt — Wed, 01 Apr 2015 10:16:52 GMT

THx jaap.

Re: Advantage and disadvantage of hash table?

Patrick — Wed, 01 Apr 2015 11:02:32 GMT

You say that for a hash "the more buckets the faster" but if I read the SAS documentation SAS(R) 9.4 Component Objects: Reference, Second Edition then it states that for optimal performance the number of buckets must be in some relation to the number of items in a bucket.

I have never found a fully satisfying explanation or guideline what this relationship should be or how to determine "automatically" the right exponent based on the number of records in the source table. SAS DI Studio uses some log() function to create dynamically a hash exponent but interestingly the result of this function returns a hashexp of "8" for 1 million records which is in contradiction to the documentation (link above) which states that a hashexp of 9 or 10 would result in the best performance.

Any additional explanation or pointer to papers would be very welcome (I've read a few of Paul Dorfman's papers but couldn't really derive "my answer" out of these papers).

Thanks

Patrick

Re: Advantage and disadvantage of hash table?

Ksharp — Wed, 01 Apr 2015 12:05:47 GMT

Hi Patrick,

Almost the knowledge about Hash Table I learned is from Paul Dorfman's papers. Of course I know how to Hash Map of Jave, I learned it before . Go back to your question , you are right "the right exponent based on the number of records in the source table".

If you have only a couple of hundred obs and define lots of buckets like 2^20 , that would not bring you any advantage of speed , whereas it will cost you lots of memory . Therefore ,It is good for SAS DI Studio uses some log() function to create dynamically a hash exponent . I actually have no idea how to decide the number of Hash Table's buckets . when I have a couple of millions obs , I used to use 20 .Maybe that is too waste . I don't know .

Xia Keshan

Re: Advantage and disadvantage of hash table?

jakarman — Wed, 01 Apr 2015 12:06:24 GMT

Hi, I do not think SAS is using a Unique hash solution, the generic approach is described at Hash table - Wikipedia, the free encyclopedia.
The approach of using the buckets/hashes is the same as optimizing the spread in a DBMS environment. Same logical question just a different area where it is applied. Cannot let it go. The SQL hash spreading the data is the application of the bucket/hash algorithm. The hash is used in a DBMS File Structure to place records in some preferred page-areas (like buckets).

It is avoiding the data movement within memory. Also that one can get costly. Imagine removing/adding 1 record of data would cause to remove all others around the memory. That will possible need a lot of cpu-instructions. Avoiding those will speed up (decrease) time.

The details on the implemented hashing technique is less relevant. It can be changes and optimized by SAS (if you use that one) in time.

Re: Advantage and disadvantage of hash table?

gyambqt — Wed, 01 Apr 2015 22:54:36 GMT

Hi Kurt,

http://web.utk.edu/sas/OnlineTutor/1.2/en/60477/m81/m81_2.htm

http://web.utk.edu/sas/OnlineTutor/1.2/en/60477/m81/m81_3.htm

If you read the above links(very short), it said :

SAS copies the data from the input data set to a buffer in memory
one observation at a time is loaded into the program data vector
A buffer can be treated as a container in memory that is big enough for only one page of data.
A page is the unit of data transfer between the storage device and memory
The amount of data that can be transferred to one buffer in a single I/O operation is referred to as page size. Page size is analogous to buffer size for SAS data sets.

I think the buffer here means the SAS buffer rather than OS buffer(can we ignore OS here).

The size of buffer can be adjusted in SAS, the bigger it is , the more data it can be taken each time so consequently less I/O.

My Question is: if buffer size is small, for example it only can take 10k records each time, but you have 50k records want to load to hash table. It still takes 5 I/O operations to transfer the 50k records from hard disk to hash(memory).

I think it is little bit different to your remark that only one I/O operation required to load the 50k records data from hard disk to hash.

My understanding of I/O is when data records from hard disk are read by SAS to SAS buffer. If SAS buffer size is small then it will require more I/O operation to finish reading complete dataset..

Do I have any problem to understand that?