- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello All,
I am currently working on a large data migration project that requires me to process very large datasets. I am looking for a way to optimise I/O efficiency as the project data processing is time sensitive i.e. the client needs their systems to be offline for the least amount of time possible. The SAS server that I am using has the following base specifications: 6 CPU's with 128GB of RAM. On average the server CPU's are at around 90% idle and uses about 25% of its RAM during daily use meaning that approximately 75% of the RAM is underutilised.
I have started experimenting with the SASFILE and COMPRESS commands however I would still like to reduce I/O between datasteps.
Does SAS automatically keep processed datasets in memory within a SASProg if the following datastep requires the same data for processing? i.e. will SAS only write the final dataset to disk after the final datastep and if not, how could I possibly implement this?
E.g.
data test1; set mydata.census; 1 run; data test2; set mydata.census; 2 run; proc summary data=mydata.census print; 3 run; data mydata.census; 4 modify mydata.census; . . (statements to modify data) . run;
Any advice or tips would be greatly appreciated.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Matt,
It is indeed possible on a UNIX system, but it depends on your flavour of UNIX, and possibly your relationship with your administrator.
For Linux you would need to use tmpfs and ask your admin to create a tmpfs mount point for you. If you're on Linux you might have /dev/shm already mounted, but it will be restricted in size to half of the RAM available on your physical machine.
I wrote a paper on this for SAS Global Forum last year. You'll find the 'Reporposing Memory' section on Page 6 most interesting as it discusses the different approaches of storing data directly in memory on *nix.
Hope that helps.
Nik
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Very interesting subject.
The more memory you assign to the SAS session (MEMESIZE), there more likely is it for SAS to use this as an internal swap. That said, I'm not convinced that is used for data set output processing. I think that output data is written directly to disk, and not kept in RAM. But that might have to be verified by a SAS Institute SW developer...
SASFILE will actually store specified data into RAM. But in your example, it would be awkward, and will not give you any advantage. SASFILE is best suited for data sets referenced several times.
COMPRESS does only work on disk. So yes, it would in some cases reduce I/O, but you pay with CPU.
The one thing that you could do is work with views whenever you can. Then there a quite high possibility that those are evaluated within memory (if enough). So if this is crucial, you may need to change from PROC SUMMARY to PROC SQL GROUP BY etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your insights regarding MEMSIZE (will investigate this further), would be great to have some input from a SAS Institute SW developer on this topic.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If it is a massive project, and teh data is critical, then you should be running a mirrored system. I.e. two of the same. So that if one goes down the other effectively carries on. This secondary system should be away from the main to give disaster recovery.
If you have that, then it shouldn't be a problem, you take the backup offline to upgrade, then set that as main, and set the other to mirror the updates. Of course that is once you have taken at least one full system restore point (probably to tape or something) and moved that offsite.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
When you write data, the data is first written into the non-persistent file cache that the operating system maintains, and is subsequently flushed out to disk. The flush happens either periodically (to reduce probable data loss in case of system crash or power loss) or when the system runs out of file cache space.
If the system has enough RAM for file cache, a read of a dataset that was just written will occur mainly from the cache and will be quite fast.
Mind that this requires proper operating system configuration. What platform are you using for SAS?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The cache that you mention, is it held in RAM or on disk?
Platform information:
PROC PRODUCT_STATUS;
25 RUN;
For Base SAS Software ...
Custom version information: 9.3_M1
Image version information: 9.03.01M1P110211
Do you perhaps have any insight on reducing I/O requests?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The operating system's file cache is only kept in physical memory, everything else would be counterproductive.
With platform I meant the operating system of the SAS server.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am using a unix server.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'd be careful with the following but instead of using the SASFILE command, you can also assign a library which points to memory as documented here (link for Windows OS but similar info also available for UNIX/LINUX):
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
You may want to investigate the Piping Feature of SAS/CONNECT (if you have that module licensed). The benefits of piping include:
- overlapped execution of proc and/or data step
- eliminate intermediate write to disk
- improved performance
- reduced disk space requirements
Hope this helps,
Ahmed
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
SAS Forum How do you persist data in memory (RAM)
between datasteps i.e. reduce I/O between datasteps?
inspired by
https://goo.gl/bVcEGd
https://communities.sas.com/t5/General-SAS-Programming/How-do-you-persist-data-in-memory-RAM-between-datasteps-i-e/m-p/326987
I do not know exactly how SAS implements DOSUBL, but I suspect
a DOSUBL runs in one virtual address space.
Here I am sharing a storage location anmong datasteps.
I suspect they both live in the same address space
so storages is less likely to be released or paged?
A lot depends on how SAS colpiles DOSUBL.
Seems to me this has potential for sharing HASHes?
HAVE
====
A common block or memory that I want to share with multiple datasteps
data parent;
%commonc(cartype $8,ACTION=INIT); /* same virtual address */
cartype='PARENT';
put cartype=;
...
data child1;
%commonc(cartype $8,ACTION=PUT); /* same virtual address */
put cartype=;
WANT cartype=PARENT in both datastep (even though not defined in child)
========================================================================
Cartype=PARENT
and again
Cartype=PARENT
Note wou can change the cartype in the
child and it will appear in the parent
SOLUTION (Here I chage cartype in the child and the change shows up in the parent)
====================================================================================
data _null_;
%commonc(cartype $8,action=INIT);
set sashelp.cars(obs=1);
cartype=make;
put cartype=; /* CARTYPE=Acura */
rc=dosubl('
data test1;
set class;
run;
data test2;
set class(obs=1);
cartype="HONDA";
%commonc(cartype $8,ACTION=PUT);
run;
proc means data=class;
run;
');
put cartype=; /* CARTYPE=HONDA */
run;quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is the commonc macro
%macro commonc(var,action=INIT);
* dosubl sets sysindex to 1;
* we are in dosubl if sysindex=1;
* increment sysindex so it is not 1 next time macro called;
%local varcut varlen;
%let varcut=%scan(&var,1);
%let varlen=%scan(&var,2);
%if %upcase(&action) = INIT %then %do;
length &var;
retain &varcut " ";
call symputx("varadr",put(addrlong(&varcut.),hex16.),"G");
put "***PARENT &var &varcut &varlen &SYSDATASTEPPHASE &sysindex";
%end;
%if "%upcase(&action)" = "PUT" %then %do;
length &var;
retain &varcut;
call pokelong(&varcut.,"&varadr."x, &varlen.);
%end;
%else %if "%upcase(&action)" = "GET" %then %do;
retain &varcut " ";
&varcut = peekclong("&varadr."x,&varlen.);
%end;
put "***CHILD &var &varcut &varlen &SYSDATASTEPPHASE &sysindex";
%mend commonc;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Have you looked at the MEMLIB option (Windows only) for libnames ?
A SAS option called MEMCACHE also exists for using and managing in-memory data, but I do not recommend using it as it is still immature.
I attach below a few excepts from
https://www.amazon.com/High-Performance-SAS-Coding-Christian-Graffeuille/dp/1512397490
where the topic of memory-based data in SAS is covered.
MEMLIB
Set As: System Option At Startup, Libname Option (Windows only)
...
If we use this option within the LIBNAME statement, we can create a library in memory with the characteristics mentioned above: high speed with an associated risk of filling up available memory. The syntax is slightly odd because we will have to provide an existing physical path for the library, which will never be used.
libname RAMLIB "c:\" memlib;
...
- If you want to keep the data created in your RAM libraries, don’t forget to copy it to a permanent library before ending your SAS session.
- When you no longer need your library, make sure to free up the memory by deleting all the files, otherwise the data will stay in memory. This can be done from within SAS by running this program:
proc datasets lib=RAMLIB kill nolist;
quit;
libname RAMLIB clear;
MEMMAXSZ
Set As: System Option At Startup (Windows only)
This option specifies the maximum amount of memory to allocate for memory-based libraries. The memory allocated by MEMMAXSZ is outside of the REALMEMSIZE allocation.
MEMBLKSZ
Set As: System Option At Startup (Windows only)
This option sets the memory block size for RAM-based libraries. The value of the MEMBLKSZ system option defines the amount of memory that is initially allocated.
Additional memory can be allocated as needed in multiples of MEMBLKSZ up to the amount of memory that is specified by the MEMMAXSZ option.
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your detailed response. Do you perhaps know if this is possible in a Unix based SAS environment?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Matt,
It is indeed possible on a UNIX system, but it depends on your flavour of UNIX, and possibly your relationship with your administrator.
For Linux you would need to use tmpfs and ask your admin to create a tmpfs mount point for you. If you're on Linux you might have /dev/shm already mounted, but it will be restricted in size to half of the RAM available on your physical machine.
I wrote a paper on this for SAS Global Forum last year. You'll find the 'Reporposing Memory' section on Page 6 most interesting as it discusses the different approaches of storing data directly in memory on *nix.
Hope that helps.
Nik