Solved: Re: How To Work Efficiently with Very Big SAS Datasets

ehsanmath · Posted 07-30-2014 03:23 AM

Hallo,

I am going to work with very big SAS datasets in next days. The total size of the Dataset is approximately 1.5 TB. The Dataset consists of usual mail history from Email Marketing. There are five columns in the Dataset and it has the following structure.

WebID; UniqueID; LID; Start_Date; Email; error_type

4343; 4413216; 54435; 30AUG12:03:32:37; AMITNARANG786 at YAHOCOM; 75768

4342; 4314515; 66140; 10AUG12:04:32:20; ATRIZIA212 at LIBEROIT; 74475

1772; 0366771; 11713; 23DEC08:17:39:53; ALEXVRES at ONLINEDE; 36075

1390; 3106709; 45578; 10OCT11:10:11:29; ANITVOEGTLI at GMXDE; 10336

I shall be thankful for suggestions for dealing with this Dataset. Definitely, I will be using most of the time basic "BASE SAS functionality" as follows:

1) "proc sort" : I have to sort big datasets multiple times by WebID, by UniqueID, by LID, by WebID UniqueID etc.

2) "proc mean": I have to use proc mean a number of times

3) "merge statement": I have to merge two or more datasets

...

etc.

In Short: my question is How To Work Efficiently with Very Big SAS Datasets.

I will appriciate your help.

regards

Dr. Ehsan Ullah

jakarman · Posted 07-30-2014 04:08 AM

Some physical questions:

1/ You are saying these are SAS datasets not externally stored ones in a RDBMS?

1,5Tb is big not that big try to gather information on the physical constraints on the IO-system.

Spreading IO processing on several physical will gain in II time. Are there SSD's involved?

2/ You system will have some CPU's and memory. You will need to know what you are having and how it can be used.

Spreading load over cpu's will decrease total processing time but possible overload the system

Using the memory as much as possible instead of IO wel decrease processing time. Overloading memory can cause trashing by wich you loose any progress.

3/ SAS system options.

These can improve processing by optimizing blocksizes (not too small) set those according OS system setting (aligniofiles) minimizing conversion overheads.

Compressing (compress= ) can help at the cost of CPU minimizing IO. This is dependent on the datastructures and contents.

Within this area it is the goal to optimize utilization.

Some logical analytics approaches:

1/ get rid of necessary data as quick as possible and proceed with as small as possible

2/ minimize necessary steps in your processing. Better run at once on data than accessing that over and over again.

For example A SAS dataset can split I multiple in one run using a sas datastep. Using Sql that needs several passes.

3/ The effects of physical sequential IO or random access IO. The first one will run much faster (head movements harddisk).

Some rule of thumb: When you are accessing more as 20% of the data you can process it sequentially. When less random access will be faster.

When some datasets are accessed random more often define indexes on them (or direct pointers)

Your ideas:

Sorting... Defining indexes could eliminate that. (think on 20% thumb rule)

proc means ... Can run sequential all analyses done as class statements. Instead of running multiple times one pass.

The disadvantage will be thinking on memory usage memsize setting etc.

merge That is often a standard approach for table-lookups. for table-lookups think on the hash object and using formats.

Those are cpu/memory oriented no IO constraints. Doing the job better than the SQL only thinking

(much more)

coding designed for big data processing will also work on small data. Test and validate your code with small sizes (sample or set obs=)

coding just for small data does not need to able run on big data. (constraints limitations). Designing your process and rethinking the performance aspects is an approach to avoid pitfalls. How often do we hear: it run on my sample but not on the real data.

---->-- ja karman --<-----

View solution in original post

jakarman · Posted 07-30-2014 04:08 AM

Some physical questions:

1/ You are saying these are SAS datasets not externally stored ones in a RDBMS?

1,5Tb is big not that big try to gather information on the physical constraints on the IO-system.

Spreading IO processing on several physical will gain in II time. Are there SSD's involved?

2/ You system will have some CPU's and memory. You will need to know what you are having and how it can be used.

Spreading load over cpu's will decrease total processing time but possible overload the system

Using the memory as much as possible instead of IO wel decrease processing time. Overloading memory can cause trashing by wich you loose any progress.

3/ SAS system options.

These can improve processing by optimizing blocksizes (not too small) set those according OS system setting (aligniofiles) minimizing conversion overheads.

Compressing (compress= ) can help at the cost of CPU minimizing IO. This is dependent on the datastructures and contents.

Within this area it is the goal to optimize utilization.

Some logical analytics approaches:

1/ get rid of necessary data as quick as possible and proceed with as small as possible

2/ minimize necessary steps in your processing. Better run at once on data than accessing that over and over again.

For example A SAS dataset can split I multiple in one run using a sas datastep. Using Sql that needs several passes.

3/ The effects of physical sequential IO or random access IO. The first one will run much faster (head movements harddisk).

Some rule of thumb: When you are accessing more as 20% of the data you can process it sequentially. When less random access will be faster.

When some datasets are accessed random more often define indexes on them (or direct pointers)

Your ideas:

Sorting... Defining indexes could eliminate that. (think on 20% thumb rule)

proc means ... Can run sequential all analyses done as class statements. Instead of running multiple times one pass.

The disadvantage will be thinking on memory usage memsize setting etc.

merge That is often a standard approach for table-lookups. for table-lookups think on the hash object and using formats.

Those are cpu/memory oriented no IO constraints. Doing the job better than the SQL only thinking

(much more)

coding designed for big data processing will also work on small data. Test and validate your code with small sizes (sample or set obs=)

coding just for small data does not need to able run on big data. (constraints limitations). Designing your process and rethinking the performance aspects is an approach to avoid pitfalls. How often do we hear: it run on my sample but not on the real data.

---->-- ja karman --<-----

Kurt_Bremser · Posted 07-30-2014 05:19 AM

a) Get the fastest storage infrastructure available. If possible, try to cobble together a bunch of SSD's to form big enough volumes.

b) Set up a separate (big enough) volume to assign UTILLOC to. That way the temporary files during sorting are not created on the physical disks that your data already resides on. This reduces contention and the risk of out of space conditions.

c) For MERGEs, try to allocate separate physical volumes for the respective sources and target data sets. This may be less important when using SSD's, because they have less latency

d) Use the DATA step instead of PROC SQL.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 07-30-2014 09:19 AM

1) make four indexes for these four BY .

2) data step better.

3) Hash Table , if you could have big enough memory .

Xia Keshan

AhmedAl_Attar · Posted 07-30-2014 10:45 AM

Dealing with SAS data sets/tables of such sizes requires Architectural and Coding special considerations and approaches.

From System Architecture/Infrastructure point of view

I/O subsystem and it's sustained throughput is something you'll need to pay special attention to. Not sure what's your level of influence within your Organization, and at what stage of the project you are in, but if you want to ensure efficient utilization and acceptable run times, then you need to get the right Storage specifications in place.
The correct Ratio of GB of RAM per CPU/CPU Core (4|6|8 GB of RAM per CPU/CPU Core) has an impact on the performance and response times.

From SAS Storage Engine point of view

SAS Base Engine data set: Typically suitable for tables under 20 GB in size (Some people may disagree with this!). Each data set is a single binary file, therefore you have to look at it from Processing (Reading, Updating), Backing-up, and Restoring point of views.
SAS SPD Engine data set: Best suited for processing large tables and reducing response times. It requires deeper technical skills to setup and use. Check out the following link for additional details SAS® Scalable Performance Data Engine, There are links to When to use, System Requirements, and other useful links.

From SAS Coding point of view

I think you have already got some wonderful and insightful responses already. There is not much I can add here.

Good luck,

Ahmed

ehsanmath · Posted 07-31-2014 10:33 AM

Since I am very new to SAS, most of these concepts are quite new for me. I am highly thankful to all of you for sharing wonderful idea's.

at Jaap Karman: thanks a lot for such a long answer. You have asked few questions. yes, I have SAS Datasets (that I read from .CSV's) which are only on one Hard Drive.

Let me start working on these ideas. In case of further (concrete) Questions I will write again.

A lots of thanks once again

How To Work Efficiently with Very Big SAS Datasets

Re: How To Work Efficiently with Very Big SAS Datasets

Re: How To Work Efficiently with Very Big SAS Datasets

Re: How To Work Efficiently with Very Big SAS Datasets

Re: How To Work Efficiently with Very Big SAS Datasets

Re: How To Work Efficiently with Very Big SAS Datasets

Re: How To Work Efficiently with Very Big SAS Datasets

The 2025 SAS Hackathon has begun!