BookmarkSubscribeRSS Feed
Kurt_Bremser
Super User

Forget your concern with the buffering, because you can NOT influence the cacheing that SAS does, apart from how much memory you grant it through MEMSIZE.

Anytime SAS accesses dataset(s), it tries to do some read-ahead etc based on its own logic, but you cannot influence that.

What you can influence is the number of logical reads to a certain object; the hash table is a method to guarantee that a certain object is read only once from disk and then kept in memory, now matter how often and in what sequence you access it.

When you have a simple 1 to 1 or 1 to n relationship based on a given key, sorting and merging is as simple as it can get. During the merge, SAS will go sequentially through the datasets, and perform some read-ahead based on its own memory parameters; meanwhile, the OS will also read ahead into the persistent system memory according to its own educated guesses based on process behaviour. Since the OS knows about all processes curently running (your SAS process only knows about itself), it is best to leave most of the work to the OS by keeping SAS memsize small in a multiuser environment.

When you need to deal with a complex and mostly unpredictable relationship (eg reading a start and end date and match that with parameters about a given time range, like holidays - even in different regions!), then a hash table is a good method to prevent multiple random reads to the lookup table (or even multiple passes through the whole table), which will completely skewer all attempts by the OS to make intelligent decisions what to cache and what not for optimum system performance.

Randomized reads are also the primary cause for wait states in the storage subsystem, and bad overall performance.

gyambqt
Obsidian | Level 7

Thanks for your reply.

I was confused by the following link indeed:

http://web.utk.edu/sas/OnlineTutor/1.2/en/60477/m81/m81_2.htm

It said

  • SAS copies the data from the input data set to a buffer in memory          
  • one observation at a time is loaded into the program data vector
  • each observation is written to an output buffer when processing             is complete
  • the contents of the output buffer are written to the disk when the             buffer is full
gyambqt
Obsidian | Level 7

Hi Kurt,

http://web.utk.edu/sas/OnlineTutor/1.2/en/60477/m81/m81_2.htm

http://web.utk.edu/sas/OnlineTutor/1.2/en/60477/m81/m81_3.htm

If you read the above links(very short), it said :

  • SAS copies the data from the input data set to a buffer in memory
  • one observation at a time is loaded into the program data vector
  • A buffer can be treated as a container in memory that is big enough for only one page of data.
  • A  page is the unit of data transfer between the storage device and memory
  • The amount of data that can be transferred to one buffer in a single  I/O operation is referred to as page size. Page size is analogous to buffer size for SAS data sets.

I think the buffer here means the SAS buffer rather than OS buffer(can we ignore OS here).

The size of buffer can be adjusted in SAS, the bigger it is , the more data it can be taken each time so consequently less I/O.

My Question is: if buffer size is small, for example it only can take 10k records each time, but you have 50k records want to load to hash table. It still takes 5 I/O operations to transfer the 50k records from hard disk to hash(memory).

I think it is little bit different to your remark that only one I/O operation required to load the 50k records data from hard disk to hash.

My understanding of I/O is when data records from hard disk are read by SAS to SAS buffer. If SAS buffer size is small then it will require more I/O operation to finish reading complete dataset..

Do I have any problem to understand that?

Kurt_Bremser
Super User

You completely miss the layer that is below SAS. The number of I/O operations is determined by the actual structure of the disk storage, like

- type of filesystem

- block size (in case of jfs2, 4096 bytes)

- organisation of partitions

and handled by the OS. SAS itself uses a page structure in its files and will read each page separately. SAS page size is dependent (IIRC) on originating SAS version, file system parameters, and record size of the SAS dataset.

In case you have a file that the OS will easily fit into its own cache, no disk operations will be necessary after the initial prefetch that the OS does for you. Still the hash object method will be more efficient because you avoid multiple random system calls (including the necessary context switches). And the hash method gives you the abilty to define and use a random key on unsorted data, something that is not available when reading from a dataset.

Bottom line: to determine which method is best for a given problem, run a test of alternative solutions in a real scenario and compare the outcomes. Some methods are better for singleuser/singletasking environments, others for multiuser/multitasking.

gyambqt
Obsidian | Level 7

Hi Kurt,

Could you explain briefly why hash table only take a single I/O operation.

data on disk----> memory (single operation?)

Thanks

Kurt_Bremser
Super User

It's not a single I/O operation, it is a SINGLE SEQUENTIAL READ OPERATION through the file. Each record is read exactly once from storage while placing it into the hash, instead of reading it repeatedly with other methods.

jakarman
Barite | Level 11

Agree with Kurt. His remark to random access to the OS filesystem is an important one. It will diminish using SSD's but with the rotating harddisk-s those mechanical delays are the reason why a SQL approach can be slower  the sorting and merging.
Seen the turnaround for that a 10-20% partially data-access above that sequential is performing faster. Remember RDBMS with SQL where designed for OLTP and not for big data processing. Only at the moment of mass parallel processing SQL can gain in speed/performance. That is bringing you to Parallel computing - Wikipedia, the free encyclopedia (HAdoop Teradata grid)


The PDV is only a SAS-datastep concept that is existent AFTER all data has come in.
The buffering of those records is not being controlled easily although you can do some things. As where and keep processing is done on those records indepently (dataset options). SAS(R) 9.4 Data Set Options: Reference, Second Edition

Just see those and realize they are outside the PDV scope.  http://www.sas.com/content/dam/SAS/en_ca/User%20Group%20Presentations/TASS/Mehatab-DataStepPDV.pdf  A nice presentation shows the buffer as interaction to your OS, all your data processing with the datastep on using the PDV. The where statement is confusing as it moved outside the PDV and executes before that. This is different to IF and Select/where as they process all data seen by the PDV. For performance it is the best no data that is needed coming in seen in the PDV. When optimizing outgoing data you can drop/keep those before getting out.  

---->-- ja karman --<-----
gyambqt
Obsidian | Level 7

I have read the powerpoint about PDV, the buffer is still existed but according to kurt remarkthere is no good way to control how it read the data records on hard disk.

I have read other people's remark, for 1 to 1 matching, there is no differencn between data step merge and hash table match in the aspect of i/o.

since dataset are  accessed sequentially using data step merge. Shouldn't it cause more I/o than hash table match?

jakarman
Barite | Level 11

The datastep merge by is NOT randomly accessing the data but does I sequentially. Sequentially access is for the OS (buffering) very predictable.
The overhead is caused by getting the data ordered first (the sorting).  The intermediate approach is using indexes. The index is sorted but the data that is retrieved isn't.  The most bad approach will be a sequential search from start for every record.

The most efficient one is "balance line" that is why that one is important and can be found in a RDBMS (explain paths etc).

  

Typical indexed access will cause a random IO pattern (normal with SQL). This one is the same as point= and indexed by usage with a SAS datastep.

Hash will avoid all IO but you can still do a "balance line" or index or restart from first obs with that. That is your choice using the hashes.  

---->-- ja karman --<-----
gyambqt
Obsidian | Level 7

THx jaap.

jakarman
Barite | Level 11

Hi, I do not think SAS is using a Unique hash solution, the generic approach is described at Hash table - Wikipedia, the free encyclopedia.
The approach of using the buckets/hashes is the same as optimizing the spread in a DBMS environment. Same logical question just a different area where it is applied. Cannot let it go. The SQL hash spreading the data is the application of the bucket/hash algorithm. The hash is used in a DBMS File Structure to place records in some preferred page-areas (like buckets).

It is avoiding the data movement within memory. Also that one can get costly. Imagine removing/adding 1 record of data would cause to remove all others around the memory. That will possible need a lot of cpu-instructions. Avoiding those will speed up (decrease) time.     

The details on the implemented hashing technique is less relevant. It can be changes and optimized by SAS (if you use that one) in time. 

---->-- ja karman --<-----

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 25 replies
  • 8442 views
  • 1 like
  • 5 in conversation