Hallo,
I am going to work with very big SAS datasets in next days. The total size of the Dataset is approximately 1.5 TB. The Dataset consists of usual mail history from Email Marketing. There are five columns in the Dataset and it has the following structure.
WebID; UniqueID; LID; Start_Date; Email; error_type
4343; 4413216; 54435; 30AUG12:03:32:37; AMITNARANG786 at YAHOCOM; 75768
4342; 4314515; 66140; 10AUG12:04:32:20; ATRIZIA212 at LIBEROIT; 74475
1772; 0366771; 11713; 23DEC08:17:39:53; ALEXVRES at ONLINEDE; 36075
1390; 3106709; 45578; 10OCT11:10:11:29; ANITVOEGTLI at GMXDE; 10336
I shall be thankful for suggestions for dealing with this Dataset. Definitely, I will be using most of the time basic "BASE SAS functionality" as follows:
1) "proc sort" : I have to sort big datasets multiple times by WebID, by UniqueID, by LID, by WebID UniqueID etc.
2) "proc mean": I have to use proc mean a number of times
3) "merge statement": I have to merge two or more datasets
...
etc.
In Short: my question is How To Work Efficiently with Very Big SAS Datasets.
I will appriciate your help.
regards
Dr. Ehsan Ullah
Some physical questions:
1/ You are saying these are SAS datasets not externally stored ones in a RDBMS?
1,5Tb is big not that big try to gather information on the physical constraints on the IO-system.
Spreading IO processing on several physical will gain in II time. Are there SSD's involved?
2/ You system will have some CPU's and memory. You will need to know what you are having and how it can be used.
Spreading load over cpu's will decrease total processing time but possible overload the system
Using the memory as much as possible instead of IO wel decrease processing time. Overloading memory can cause trashing by wich you loose any progress.
3/ SAS system options.
These can improve processing by optimizing blocksizes (not too small) set those according OS system setting (aligniofiles) minimizing conversion overheads.
Compressing (compress= ) can help at the cost of CPU minimizing IO. This is dependent on the datastructures and contents.
Within this area it is the goal to optimize utilization.
Some logical analytics approaches:
1/ get rid of necessary data as quick as possible and proceed with as small as possible
2/ minimize necessary steps in your processing. Better run at once on data than accessing that over and over again.
For example A SAS dataset can split I multiple in one run using a sas datastep. Using Sql that needs several passes.
3/ The effects of physical sequential IO or random access IO. The first one will run much faster (head movements harddisk).
Some rule of thumb: When you are accessing more as 20% of the data you can process it sequentially. When less random access will be faster.
When some datasets are accessed random more often define indexes on them (or direct pointers)
Your ideas:
Sorting... Defining indexes could eliminate that. (think on 20% thumb rule)
proc means ... Can run sequential all analyses done as class statements. Instead of running multiple times one pass.
The disadvantage will be thinking on memory usage memsize setting etc.
merge That is often a standard approach for table-lookups. for table-lookups think on the hash object and using formats.
Those are cpu/memory oriented no IO constraints. Doing the job better than the SQL only thinking
(much more)
coding designed for big data processing will also work on small data. Test and validate your code with small sizes (sample or set obs=)
coding just for small data does not need to able run on big data. (constraints limitations). Designing your process and rethinking the performance aspects is an approach to avoid pitfalls. How often do we hear: it run on my sample but not on the real data.
Some physical questions:
1/ You are saying these are SAS datasets not externally stored ones in a RDBMS?
1,5Tb is big not that big try to gather information on the physical constraints on the IO-system.
Spreading IO processing on several physical will gain in II time. Are there SSD's involved?
2/ You system will have some CPU's and memory. You will need to know what you are having and how it can be used.
Spreading load over cpu's will decrease total processing time but possible overload the system
Using the memory as much as possible instead of IO wel decrease processing time. Overloading memory can cause trashing by wich you loose any progress.
3/ SAS system options.
These can improve processing by optimizing blocksizes (not too small) set those according OS system setting (aligniofiles) minimizing conversion overheads.
Compressing (compress= ) can help at the cost of CPU minimizing IO. This is dependent on the datastructures and contents.
Within this area it is the goal to optimize utilization.
Some logical analytics approaches:
1/ get rid of necessary data as quick as possible and proceed with as small as possible
2/ minimize necessary steps in your processing. Better run at once on data than accessing that over and over again.
For example A SAS dataset can split I multiple in one run using a sas datastep. Using Sql that needs several passes.
3/ The effects of physical sequential IO or random access IO. The first one will run much faster (head movements harddisk).
Some rule of thumb: When you are accessing more as 20% of the data you can process it sequentially. When less random access will be faster.
When some datasets are accessed random more often define indexes on them (or direct pointers)
Your ideas:
Sorting... Defining indexes could eliminate that. (think on 20% thumb rule)
proc means ... Can run sequential all analyses done as class statements. Instead of running multiple times one pass.
The disadvantage will be thinking on memory usage memsize setting etc.
merge That is often a standard approach for table-lookups. for table-lookups think on the hash object and using formats.
Those are cpu/memory oriented no IO constraints. Doing the job better than the SQL only thinking
(much more)
coding designed for big data processing will also work on small data. Test and validate your code with small sizes (sample or set obs=)
coding just for small data does not need to able run on big data. (constraints limitations). Designing your process and rethinking the performance aspects is an approach to avoid pitfalls. How often do we hear: it run on my sample but not on the real data.
a) Get the fastest storage infrastructure available. If possible, try to cobble together a bunch of SSD's to form big enough volumes.
b) Set up a separate (big enough) volume to assign UTILLOC to. That way the temporary files during sorting are not created on the physical disks that your data already resides on. This reduces contention and the risk of out of space conditions.
c) For MERGEs, try to allocate separate physical volumes for the respective sources and target data sets. This may be less important when using SSD's, because they have less latency
d) Use the DATA step instead of PROC SQL.
1) make four indexes for these four BY .
2) data step better.
3) Hash Table , if you could have big enough memory .
Xia Keshan
Dealing with SAS data sets/tables of such sizes requires Architectural and Coding special considerations and approaches.
From System Architecture/Infrastructure point of view
From SAS Storage Engine point of view
From SAS Coding point of view
I think you have already got some wonderful and insightful responses already. There is not much I can add here.
Good luck,
Ahmed
Since I am very new to SAS, most of these concepts are quite new for me. I am highly thankful to all of you for sharing wonderful idea's.
at Jaap Karman: thanks a lot for such a long answer. You have asked few questions. yes, I have SAS Datasets (that I read from .CSV's) which are only on one Hard Drive.
Let me start working on these ideas. In case of further (concrete) Questions I will write again.
A lots of thanks once again
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.