BookmarkSubscribeRSS Feed
gyambqt
Obsidian | Level 7

Hello experts?

I have seen people say Hash table can improve I/O process and do not understand why.

Assume you have two tables want to merge each table contain 50k records,  Table A and Table B . Also Input buffer can only take 10k records each time.

if you use data step merge then 50k+50k records will be loaded to input buffer for 10 times ((50k+50k)/10k=10 times)

50k+50k on hard disk---->Input buffer----->PDV---->etc.

if you use hash to hold table A, then it takes 5 times(50k/10k=5times) for data records on table A to load to input buffer then to HASH table. it will also take 5 times (50k/10k=5times) for data records in table B to be loaded to input buffer. so the total is 5+5=10 times. there is no difference....

50k records in table A on hard disk---->input buffer---->PDV----->HASH

50k records in table B on hard disk-----> input buffer-----PDV--->etc.

Please explain.

Thanks so much!!!!!

2 REPLIES 2
LinusH
Tourmaline | Level 20

Basically you are correct.  If you have a 1-1 merge by it will be practically the same i/o. But the real world isn't always that simple. For instance source data is not always sorted.

Data never sleeps
gyambqt
Obsidian | Level 7

Hi Linus,

do you mean if the source data is not sorted and you want to do data step merge. So the hard disk for data records has to be read several time from top to bottom to ensure every single records is read and matched from A with the every records in B. So it caused a lot of I/O operation. Like the following example:

assume the following dataset is not sorted.

Data A has 10k records

Data B has 10K records

SAS buffer has capacity 1k records

if we match A with B, so initially SAS buffer take 1k records from A and B, and do the matching at PDV (transfer record from buffer to PDV 1 by 1). If all the records in the SAS buffer for B has been used for matching, then buffer will remove previously selected records for B and take new portion of records from B to match with A. this process will be repeated until all the records from B are selected to match with first record in A. then after the first record for A is matched with all the records in B and it will move to the second record in A. This process will be repeated until all the records in A are selected for matching. So it resulted a lot of read input and output process.

Compare to sorted datasets.

not all the records will be selected by SAS buffer to match with every single record in A because it is sorted. So it resulted in less read input and output process.

Am I correct?

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 720 views
  • 0 likes
  • 2 in conversation