DATA Step, Macro, Functions and more

Can hash improve I/O process?

Reply
Regular Contributor
Posts: 152

Can hash improve I/O process?

Hello experts?

I have seen people say Hash table can improve I/O process and do not understand why.

Assume you have two tables want to merge each table contain 50k records,  Table A and Table B . Also Input buffer can only take 10k records each time.

if you use data step merge then 50k+50k records will be loaded to input buffer for 10 times ((50k+50k)/10k=10 times)

50k+50k on hard disk---->Input buffer----->PDV---->etc.

if you use hash to hold table A, then it takes 5 times(50k/10k=5times) for data records on table A to load to input buffer then to HASH table. it will also take 5 times (50k/10k=5times) for data records in table B to be loaded to input buffer. so the total is 5+5=10 times. there is no difference....

50k records in table A on hard disk---->input buffer---->PDV----->HASH

50k records in table B on hard disk-----> input buffer-----PDV--->etc.

Please explain.

Thanks so much!!!!!

Super User
Posts: 5,429

Re: Can hash improve I/O process?

Basically you are correct.  If you have a 1-1 merge by it will be practically the same i/o. But the real world isn't always that simple. For instance source data is not always sorted.

Data never sleeps
Regular Contributor
Posts: 152

Re: Can hash improve I/O process?

Hi Linus,

do you mean if the source data is not sorted and you want to do data step merge. So the hard disk for data records has to be read several time from top to bottom to ensure every single records is read and matched from A with the every records in B. So it caused a lot of I/O operation. Like the following example:

assume the following dataset is not sorted.

Data A has 10k records

Data B has 10K records

SAS buffer has capacity 1k records

if we match A with B, so initially SAS buffer take 1k records from A and B, and do the matching at PDV (transfer record from buffer to PDV 1 by 1). If all the records in the SAS buffer for B has been used for matching, then buffer will remove previously selected records for B and take new portion of records from B to match with A. this process will be repeated until all the records from B are selected to match with first record in A. then after the first record for A is matched with all the records in B and it will move to the second record in A. This process will be repeated until all the records in A are selected for matching. So it resulted a lot of read input and output process.

Compare to sorted datasets.

not all the records will be selected by SAS buffer to match with every single record in A because it is sorted. So it resulted in less read input and output process.

Am I correct?

Ask a Question
Discussion stats
  • 2 replies
  • 203 views
  • 0 likes
  • 2 in conversation