BookmarkSubscribeRSS Feed
gyambqt
Obsidian | Level 7

Hello Experts:

I have few questions want to ask about hash table.

1.When use hash table to look up matched records, do records in hash table scanned sequentially from top to bottom or it will jump to the corresponding key value straight just like index?

2.What is the advantage and disadvantage of hash table compare to proc sql and data step merge?

3.To be updated in the future...

Thanks

25 REPLIES 25
Kurt_Bremser
Super User

1. You can work through a hash object sequentially and with keys.

2. The advantage of a hash object is that you can avoid lots of I/O on one of the input tables (if you don't do a straight merge); the disadvantage is that you will be limited by the available memory.

gyambqt
Obsidian | Level 7


Hi KurtBremser,

I was bit confused why Hash can improve I/O process?

assume you have two tables want to merge each table contain 50k records. Table A and Table B and input buffer can only take 10k records each time.

if you use data step merge then 50k+50k records will be loaded to input buffer 10 times ((50k+50k)/10k=10 times)

50k+50k on hard disk---->Input buffer----->PDV---->etc.

if you use hash to hold table A, then it takes 5 times(50k/10k=5times) for data records on table A to load to input buffer then to HASH table. it will also take 5 times (50k/10k=5times) for data records in table B to be loaded to input buffer. so the total is 5+5=10 times. there is no difference....

50k records in table A on hard disk---->input buffer---->PDV----->HASH

50k records in table B on hard disk-----> input buffer-----PDV--->etc.

Please explain.

Kurt_Bremser
Super User

Imagine a situation where you have to match several records of dataset B to a record of dataset A, on a complex condition that can't be solved by simply sorting through the datasets an merging them along the sort.

Think of the fuzzy logic needed to match actual inputs of your users to the normalized names of, say, car types.

In that case you will have to read more or less randomly through dataset B for every record of dataset A.

If you loaded dataset B into a hash object at _n_ = 1, you have the complete I/O from dataset B only once, and sequentially at that.

Another use that I encounterd quite often is when you have to determine the actual work hours spent for a service call between initial call and problem solved (Saturday and Sunday don't count, Monday to Thursday 7am to 5pm, Friday 7am to 3pm, holidays also don't count).

gyambqt
Obsidian | Level 7


Hi KurtBremser,

I am trying to understand you with the following example, please let me know if I am wrong.

Assume we have two SAS datasets, A and B (they are not sorted based on the by variables)

We want to match A and B, but every record of A can match multiple records in B so it is One to Many match.

If I use data step merge statement, then many records from A will be selected and loaded to a buffer (memory) and many records from B will be selected and loaded a buffer (memory).

If records in the buffer  from A is NOT found in the buffer for records in B, then the buffer for B will be empty and another set of records will be selected from B into the buffer to match the records for A (in the buffer).

So this step will be repeated until all the matching records are found, so it can cause a lot of I/O (read data from disk).

But Hash can solve this problem because it holds B so there is only one I/O to match with records from A( the number of I/O will be equal to number of records from A).

Am I correct?

If I am wrong please let me know why.

Maybe my understanding of buffer and I/O is wrong...

Ksharp
Super User

1. it will jump to the corresponding key value straight just like index.  In Hash Map(in Java,  old catchphrase is Hash Table), it is called bucket which can decide the speed of querying . more buckets more fast . You could define it when constructing a Hash Map like : declare hash h(hashexp:20);   20 is the maximize value of buckets - 2^20  .

2. Hash Map is a querying tool by  using space(memory) to exchange  time . therefore disadvantage is memory limitation. and hard to handle the continuous value (double value).

Xia Keshan

gyambqt
Obsidian | Level 7

Thanks a lot !!!!

Patrick
Opal | Level 21

Hi

You say that for a hash "the more buckets the faster" but if I read the SAS documentation  SAS(R) 9.4 Component Objects: Reference, Second Edition then it states that for optimal performance the number of buckets must be in some relation to the number of items in a bucket.

I have never found a fully satisfying explanation or guideline what this relationship should be or how to determine "automatically" the right exponent based on the number of records in the source table. SAS DI Studio uses some log() function to create dynamically a hash exponent but interestingly the result of this function returns a hashexp of "8" for 1 million records which is in contradiction to the documentation (link above) which states that a hashexp of 9 or 10 would result in the best performance.

Any additional explanation or pointer to papers would be very welcome (I've read a few of Paul Dorfman's papers but couldn't really derive "my answer" out of these papers).

Thanks

Patrick

Ksharp
Super User

Hi Patrick,

Almost the knowledge about Hash Table I learned is from Paul Dorfman's papers. Of course I know how to Hash Map of Jave, I learned it before . Go back to your question , you are right "the right exponent based on the number of records in the source table".

If you have only a couple of hundred obs and define lots of buckets like 2^20 , that would not bring you any advantage of speed , whereas it will cost you lots of memory . Therefore ,It is good for SAS DI Studio uses some log() function to create dynamically a hash exponent . I actually have no idea how to decide the number of Hash Table's buckets . when I have a couple of millions obs , I used to use 20 .Maybe that is too waste . I don't know .

Xia Keshan

Patrick
Opal | Level 21

Thanks

If you ever "stumble" over a good explanation of how to derive "the right" exponent then please let me know.

Ksharp
Super User

Hi Patrick,

That doesn't matter . I always use default value 8 or 20 when I have a big table . Smiley Wink

gyambqt
Obsidian | Level 7

Hello Xia,

I have tried to add you on the wechat.

Is this your wechat ID:xia-keshan

jakarman
Barite | Level 11

Ah another duplicate question. not really duplicated  (2/ ) but having the same root cause.https://communities.sas.com/thread/74705 as of optimizing memory dim (near to processors) and memory far away (dasd).


for 1/ which approach to do, you can choose to program any of those two (indexed like Btree or sequential)

for 3/ you can add/delete data/keys while using the hashes in a streamed data processing approach. Not possible with the others.

---->-- ja karman --<-----
jakarman
Barite | Level 11

gyambqt,  There is no buffering involved in the merge logic only the current record-pointer as retrieved from the datasets is used.
You can use the point= SAS(R) 9.4 Statements: Reference, Third Edition to move that record pointer around in the dataset. A very weird but possible effective way to handle datamerging.   When the question is merging with a time-window that is present in the to be combined ordered data.

Merging can be done with notsorted data. SAS(R) 9.4 Statements: Reference, Third Edition (the by has a notsorted option).
Sorted data is well ordered data and ordered data is giving the best results with merging. But ordered does not need to be sorted.
Sounds strange the best example is a sorted dataset coming from an other environment with different encoding. That data is still nicely ordered but could have issues as not being sorted in the current encoding.     

---->-- ja karman --<-----
gyambqt
Obsidian | Level 7

Hi Jaap,

Do you mean in the data merge the data records for both datasets are loaded to PDV directly for matching? i thought the record would be loaded to buffer first then from buffer load to pdv for matching.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 25 replies
  • 7295 views
  • 1 like
  • 5 in conversation