BookmarkSubscribeRSS Feed
gyambqt
Obsidian | Level 7

Hello experts?

I have seen people say Hash table can improve I/O process and do not understand why.

Assume you have two tables want to merge each table contain 50k records,  Table A and Table B . Also Input buffer can only take 10k records each time.

if you use data step merge then 50k+50k records will be loaded to input buffer for 10 times ((50k+50k)/10k=10 times)

50k+50k on hard disk---->Input buffer----->PDV---->etc.

if you use hash to hold table A, then it takes 5 times(50k/10k=5times) for data records on table A to load to input buffer then to HASH table. it will also take 5 times (50k/10k=5times) for data records in table B to be loaded to input buffer. so the total is 5+5=10 times. there is no difference....

50k records in table A on hard disk---->input buffer---->PDV----->HASH

50k records in table B on hard disk-----> input buffer-----PDV--->etc.

Please explain.

Thanks so much!!!!!

2 REPLIES 2
LinusH
Tourmaline | Level 20

Basically you are correct.  If you have a 1-1 merge by it will be practically the same i/o. But the real world isn't always that simple. For instance source data is not always sorted.

Data never sleeps
gyambqt
Obsidian | Level 7

Hi Linus,

do you mean if the source data is not sorted and you want to do data step merge. So the hard disk for data records has to be read several time from top to bottom to ensure every single records is read and matched from A with the every records in B. So it caused a lot of I/O operation. Like the following example:

assume the following dataset is not sorted.

Data A has 10k records

Data B has 10K records

SAS buffer has capacity 1k records

if we match A with B, so initially SAS buffer take 1k records from A and B, and do the matching at PDV (transfer record from buffer to PDV 1 by 1). If all the records in the SAS buffer for B has been used for matching, then buffer will remove previously selected records for B and take new portion of records from B to match with A. this process will be repeated until all the records from B are selected to match with first record in A. then after the first record for A is matched with all the records in B and it will move to the second record in A. This process will be repeated until all the records in A are selected for matching. So it resulted a lot of read input and output process.

Compare to sorted datasets.

not all the records will be selected by SAS buffer to match with every single record in A because it is sorted. So it resulted in less read input and output process.

Am I correct?

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 732 views
  • 0 likes
  • 2 in conversation