03-30-2015 02:27 AM
I have few questions want to ask about hash table.
1.When use hash table to look up matched records, do records in hash table scanned sequentially from top to bottom or it will jump to the corresponding key value straight just like index?
2.What is the advantage and disadvantage of hash table compare to proc sql and data step merge?
3.To be updated in the future...
03-30-2015 02:33 AM
1. You can work through a hash object sequentially and with keys.
2. The advantage of a hash object is that you can avoid lots of I/O on one of the input tables (if you don't do a straight merge); the disadvantage is that you will be limited by the available memory.
03-31-2015 11:59 PM
I was bit confused why Hash can improve I/O process?
assume you have two tables want to merge each table contain 50k records. Table A and Table B and input buffer can only take 10k records each time.
if you use data step merge then 50k+50k records will be loaded to input buffer 10 times ((50k+50k)/10k=10 times)
50k+50k on hard disk---->Input buffer----->PDV---->etc.
if you use hash to hold table A, then it takes 5 times(50k/10k=5times) for data records on table A to load to input buffer then to HASH table. it will also take 5 times (50k/10k=5times) for data records in table B to be loaded to input buffer. so the total is 5+5=10 times. there is no difference....
50k records in table A on hard disk---->input buffer---->PDV----->HASH
50k records in table B on hard disk-----> input buffer-----PDV--->etc.
04-01-2015 02:35 AM
Imagine a situation where you have to match several records of dataset B to a record of dataset A, on a complex condition that can't be solved by simply sorting through the datasets an merging them along the sort.
Think of the fuzzy logic needed to match actual inputs of your users to the normalized names of, say, car types.
In that case you will have to read more or less randomly through dataset B for every record of dataset A.
If you loaded dataset B into a hash object at _n_ = 1, you have the complete I/O from dataset B only once, and sequentially at that.
Another use that I encounterd quite often is when you have to determine the actual work hours spent for a service call between initial call and problem solved (Saturday and Sunday don't count, Monday to Thursday 7am to 5pm, Friday 7am to 3pm, holidays also don't count).
04-01-2015 04:23 AM
I am trying to understand you with the following example, please let me know if I am wrong.
Assume we have two SAS datasets, A and B (they are not sorted based on the by variables)
We want to match A and B, but every record of A can match multiple records in B so it is One to Many match.
If I use data step merge statement, then many records from A will be selected and loaded to a buffer (memory) and many records from B will be selected and loaded a buffer (memory).
If records in the buffer from A is NOT found in the buffer for records in B, then the buffer for B will be empty and another set of records will be selected from B into the buffer to match the records for A (in the buffer).
So this step will be repeated until all the matching records are found, so it can cause a lot of I/O (read data from disk).
But Hash can solve this problem because it holds B so there is only one I/O to match with records from A( the number of I/O will be equal to number of records from A).
Am I correct?
If I am wrong please let me know why.
Maybe my understanding of buffer and I/O is wrong...
03-30-2015 08:52 AM
1. it will jump to the corresponding key value straight just like index. In Hash Map(in Java, old catchphrase is Hash Table), it is called bucket which can decide the speed of querying . more buckets more fast . You could define it when constructing a Hash Map like : declare hash h(hashexp:20); 20 is the maximize value of buckets - 2^20 .
2. Hash Map is a querying tool by using space(memory) to exchange time . therefore disadvantage is memory limitation. and hard to handle the continuous value (double value).
04-01-2015 07:02 AM
You say that for a hash "the more buckets the faster" but if I read the SAS documentation SAS(R) 9.4 Component Objects: Reference, Second Edition then it states that for optimal performance the number of buckets must be in some relation to the number of items in a bucket.
I have never found a fully satisfying explanation or guideline what this relationship should be or how to determine "automatically" the right exponent based on the number of records in the source table. SAS DI Studio uses some log() function to create dynamically a hash exponent but interestingly the result of this function returns a hashexp of "8" for 1 million records which is in contradiction to the documentation (link above) which states that a hashexp of 9 or 10 would result in the best performance.
Any additional explanation or pointer to papers would be very welcome (I've read a few of Paul Dorfman's papers but couldn't really derive "my answer" out of these papers).
04-01-2015 08:05 AM
Almost the knowledge about Hash Table I learned is from Paul Dorfman's papers. Of course I know how to Hash Map of Jave, I learned it before . Go back to your question , you are right "the right exponent based on the number of records in the source table".
If you have only a couple of hundred obs and define lots of buckets like 2^20 , that would not bring you any advantage of speed , whereas it will cost you lots of memory . Therefore ,It is good for SAS DI Studio uses some log() function to create dynamically a hash exponent . I actually have no idea how to decide the number of Hash Table's buckets . when I have a couple of millions obs , I used to use 20 .Maybe that is too waste . I don't know .
04-01-2015 03:51 AM
Ah another duplicate question. not really duplicated (2/ ) but having the same root cause.https://communities.sas.com/thread/74705 as of optimizing memory dim (near to processors) and memory far away (dasd).
for 1/ which approach to do, you can choose to program any of those two (indexed like Btree or sequential)
for 3/ you can add/delete data/keys while using the hashes in a streamed data processing approach. Not possible with the others.
04-01-2015 04:39 AM
gyambqt, There is no buffering involved in the merge logic only the current record-pointer as retrieved from the datasets is used.
You can use the point= SAS(R) 9.4 Statements: Reference, Third Edition to move that record pointer around in the dataset. A very weird but possible effective way to handle datamerging. When the question is merging with a time-window that is present in the to be combined ordered data.
Merging can be done with notsorted data. SAS(R) 9.4 Statements: Reference, Third Edition (the by has a notsorted option).
Sorted data is well ordered data and ordered data is giving the best results with merging. But ordered does not need to be sorted.
Sounds strange the best example is a sorted dataset coming from an other environment with different encoding. That data is still nicely ordered but could have issues as not being sorted in the current encoding.
04-01-2015 04:57 AM
Do you mean in the data merge the data records for both datasets are loaded to PDV directly for matching? i thought the record would be loaded to buffer first then from buffer load to pdv for matching.