04-01-2015 01:37 AM
Based on page 9 of the following link, it said data step merge is"Not memory intensive because only the current observation from each input dataset is loaded into memory".
I was bit confused with this statement as I thought the data sets should be firstly loaded (can be many records) to a buffer(memory) before it loads to PDV 1 by 1. So if the buffsize is large enough, all the SAS datasets can be loaded there. So I think data step merge sometimes can be quite memory intensive as well.
Please let me know why I am wrong.
04-01-2015 02:52 AM
SAS uses the memory allowed by the MEMSIZE system option (set from the commandline or config file) to cache file data if memory is not otherwise needed. Since any useful operating system already does that on its own, it is a SAS recommendation to keep MEMSIZE rather small in multiuser environments, so the OS can make the most intelligent decisions which data needs to be kept in cache memory.
The data step itself will never be memory intensive beyond the size of 1 record, SAS will only consume memory to preload data if you allow it via the respective options, but that is an overall fact not related to the data step as such.
In a situation where you have data that is accessed regularly and repeatedly, the OS will keep that in "persistent" memory all the time for you.
04-01-2015 03:05 AM
You are not wrong as the buffering and memory usage is confusing by the needs of understanding being involved.
Form Low to high level:
1- The device (SSD NAS) does buffering for file I/O (going into transport to be shared by many OS machine)
2- The OS does buffering for file I/O (shared processes going to each of them)
3- SAS does buffering for files I/O (Within your process memory to be shared)
It is becoming more dedicated to your process going thought this steps being able more specific on the needs of logical processing.
When coming into SAS only with the datastep there is that PDV being used. Procs like SQL and Report do not have that.
Old logical solutions like a sorted join (balance line) Balance Line Algorithm are the principal solutions still in use often hided in SQL DBMS systems.
They are coming from the era that data is big data compared to the amount of available memory. Well big data is hyping these days and very modern. The trade off of usage in available memory vs the best logical approach is en ever lasting returning question.
1- The NAS/SAN behavior is change by hardware evolution. The total size is increasing more rapidly the every parts speed.
The SSD shift is shifting some techniques that are important.
2- The OS buffering should be isolated having an max. Here is some failures with SAS on Unix:
a/ assuming all everything to be unlimited.
b/ The OS is segregating between loadabables and data but many SAS loadables (eg catalog sas files) are stored as data.
3- Wit SAS you can tune some of the busize and bufno but also using sasfile (load-file into memory) design stream processing (uses hashes) or formats
This is all on internal memory usage, this resource: DIMM - Wikipedia, the free encyclopedia (near to the processing units)
Do not confuse that with the memory as this resource Hard disk drive - Wikipedia, the free encyclopedia (external far to the processing units)
Data step merges are using more storage on the hard disk drive (as are doing SQL DMS-systems) as those are using the same approach and less of the ram-type.