04-01-2015 12:01 AM
I have seen people say Hash table can improve I/O process and do not understand why.
Assume you have two tables want to merge each table contain 50k records, Table A and Table B . Also Input buffer can only take 10k records each time.
if you use data step merge then 50k+50k records will be loaded to input buffer for 10 times ((50k+50k)/10k=10 times)
50k+50k on hard disk---->Input buffer----->PDV---->etc.
if you use hash to hold table A, then it takes 5 times(50k/10k=5times) for data records on table A to load to input buffer then to HASH table. it will also take 5 times (50k/10k=5times) for data records in table B to be loaded to input buffer. so the total is 5+5=10 times. there is no difference....
50k records in table A on hard disk---->input buffer---->PDV----->HASH
50k records in table B on hard disk-----> input buffer-----PDV--->etc.
Thanks so much!!!!!
04-01-2015 02:13 AM
Basically you are correct. If you have a 1-1 merge by it will be practically the same i/o. But the real world isn't always that simple. For instance source data is not always sorted.
04-01-2015 07:11 PM
do you mean if the source data is not sorted and you want to do data step merge. So the hard disk for data records has to be read several time from top to bottom to ensure every single records is read and matched from A with the every records in B. So it caused a lot of I/O operation. Like the following example:
assume the following dataset is not sorted.
Data A has 10k records
Data B has 10K records
SAS buffer has capacity 1k records
if we match A with B, so initially SAS buffer take 1k records from A and B, and do the matching at PDV (transfer record from buffer to PDV 1 by 1). If all the records in the SAS buffer for B has been used for matching, then buffer will remove previously selected records for B and take new portion of records from B to match with A. this process will be repeated until all the records from B are selected to match with first record in A. then after the first record for A is matched with all the records in B and it will move to the second record in A. This process will be repeated until all the records in A are selected for matching. So it resulted a lot of read input and output process.
Compare to sorted datasets.
not all the records will be selected by SAS buffer to match with every single record in A because it is sorted. So it resulted in less read input and output process.
Am I correct?