BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ehsanmath
Obsidian | Level 7

Hallo,

I am going to work with very big SAS datasets in next days. The total size of the Dataset is approximately 1.5 TB. The Dataset consists of usual mail history from Email Marketing. There are five columns in the Dataset and it has the following structure.

WebID;  UniqueID;   LID;        Start_Date;                Email;                                            error_type            

4343;    4413216;    54435;    30AUG12:03:32:37;    AMITNARANG786 at YAHOCOM; 75768

4342;    4314515;    66140;    10AUG12:04:32:20;    ATRIZIA212 at LIBEROIT;            74475

1772;    0366771;    11713;    23DEC08:17:39:53;    ALEXVRES at ONLINEDE;           36075

1390;    3106709;    45578;    10OCT11:10:11:29;    ANITVOEGTLI at GMXDE;            10336

I shall be thankful for suggestions for dealing with this Dataset. Definitely, I will be using most of the time basic "BASE SAS functionality" as follows:

1) "proc sort" : I have to sort big datasets multiple times by WebID, by UniqueID, by LID, by WebID UniqueID etc.

2) "proc mean": I have to use proc mean a number of times

3) "merge statement": I have to merge two or more datasets

...

etc.

In Short: my question is How To Work Efficiently with Very Big SAS Datasets.

I will appriciate your help.

regards

Dr. Ehsan Ullah

1 ACCEPTED SOLUTION

Accepted Solutions
jakarman
Barite | Level 11

Some physical questions:

1/ You are saying these are SAS datasets not externally stored ones in a RDBMS?

   1,5Tb is big not that big try to gather information on the physical constraints on the IO-system.

   Spreading IO processing on several physical will gain in II time. Are there SSD's involved?

2/ You system will have some CPU's and memory. You will need to know what you are having and how it can be used.

   Spreading load over cpu's will decrease total processing time but possible overload the system

   Using the memory as much as possible instead of IO wel decrease processing time. Overloading memory can cause trashing by wich you loose any progress.

3/ SAS system options.

   These can improve processing by optimizing blocksizes (not too small) set those according OS system setting (aligniofiles) minimizing conversion overheads.

   Compressing (compress= ) can help at the cost of CPU minimizing IO. This is dependent on the datastructures and contents.    

Within this area it is the goal to optimize utilization.

Some logical analytics approaches:

1/ get rid of necessary data as quick as possible and proceed with as small as possible

2/ minimize necessary steps in your processing.  Better run at once on data than accessing that over and over again.

    For example A SAS dataset can split I multiple in one run using a sas datastep. Using Sql that needs several passes.

3/ The effects of physical sequential IO or random access IO.  The first one will run much faster (head movements harddisk).

    Some rule of thumb:  When you are accessing more as 20% of the data you can process it sequentially. When less random access will be faster.

    When some datasets are accessed random more often define indexes on them (or direct pointers)     

Your ideas:

Sorting... Defining indexes could eliminate that. (think on 20% thumb rule)

proc means ... Can run sequential all analyses done as class statements. Instead of running multiple times one pass.

    The disadvantage will be thinking on memory usage   memsize setting etc.

merge That is often a standard approach for table-lookups. for table-lookups think on the hash object and using formats.

    Those are cpu/memory oriented no IO constraints. Doing the job better than the SQL only thinking   

(much more)

coding designed for big data processing will also work on small data. Test and validate your code with small sizes (sample or set obs=)

coding just for small data does not need to able run on big data. (constraints limitations). Designing your process and rethinking the performance aspects is an approach to avoid pitfalls. How often do we hear: it run on my sample but not on the real data. 

---->-- ja karman --<-----

View solution in original post

5 REPLIES 5
jakarman
Barite | Level 11

Some physical questions:

1/ You are saying these are SAS datasets not externally stored ones in a RDBMS?

   1,5Tb is big not that big try to gather information on the physical constraints on the IO-system.

   Spreading IO processing on several physical will gain in II time. Are there SSD's involved?

2/ You system will have some CPU's and memory. You will need to know what you are having and how it can be used.

   Spreading load over cpu's will decrease total processing time but possible overload the system

   Using the memory as much as possible instead of IO wel decrease processing time. Overloading memory can cause trashing by wich you loose any progress.

3/ SAS system options.

   These can improve processing by optimizing blocksizes (not too small) set those according OS system setting (aligniofiles) minimizing conversion overheads.

   Compressing (compress= ) can help at the cost of CPU minimizing IO. This is dependent on the datastructures and contents.    

Within this area it is the goal to optimize utilization.

Some logical analytics approaches:

1/ get rid of necessary data as quick as possible and proceed with as small as possible

2/ minimize necessary steps in your processing.  Better run at once on data than accessing that over and over again.

    For example A SAS dataset can split I multiple in one run using a sas datastep. Using Sql that needs several passes.

3/ The effects of physical sequential IO or random access IO.  The first one will run much faster (head movements harddisk).

    Some rule of thumb:  When you are accessing more as 20% of the data you can process it sequentially. When less random access will be faster.

    When some datasets are accessed random more often define indexes on them (or direct pointers)     

Your ideas:

Sorting... Defining indexes could eliminate that. (think on 20% thumb rule)

proc means ... Can run sequential all analyses done as class statements. Instead of running multiple times one pass.

    The disadvantage will be thinking on memory usage   memsize setting etc.

merge That is often a standard approach for table-lookups. for table-lookups think on the hash object and using formats.

    Those are cpu/memory oriented no IO constraints. Doing the job better than the SQL only thinking   

(much more)

coding designed for big data processing will also work on small data. Test and validate your code with small sizes (sample or set obs=)

coding just for small data does not need to able run on big data. (constraints limitations). Designing your process and rethinking the performance aspects is an approach to avoid pitfalls. How often do we hear: it run on my sample but not on the real data. 

---->-- ja karman --<-----
Kurt_Bremser
Super User

a) Get the fastest storage infrastructure available. If possible, try to cobble together a bunch of SSD's to form big enough volumes.

b) Set up a separate (big enough) volume to assign UTILLOC to. That way the temporary files during sorting are not created on the physical disks that your data already resides on. This reduces contention and the risk of out of space conditions.

c) For MERGEs, try to allocate separate physical volumes for the respective sources and target data sets. This may be less important when using SSD's, because they have less latency

d) Use the DATA step instead of PROC SQL.

Ksharp
Super User

1) make four indexes for these four BY .

2) data step better.

3) Hash Table , if you could have big enough memory .

Xia Keshan

AhmedAl_Attar
Ammonite | Level 13

Dealing with SAS data sets/tables of such sizes requires Architectural and Coding special considerations and approaches.

From System Architecture/Infrastructure point of view

  • I/O subsystem and it's sustained throughput is something you'll need to pay special attention to. Not sure what's your level of influence within your Organization, and at what stage of the project you are in, but if you want to ensure efficient utilization and acceptable run times, then you need to get the right Storage specifications in place.
  • The correct Ratio of GB of RAM per CPU/CPU Core  (4|6|8 GB of RAM per CPU/CPU Core) has an impact on the performance and response times.

From SAS Storage Engine point of view

  • SAS Base Engine data set: Typically suitable for tables under 20 GB in size (Some people may disagree with this!). Each data set is a single binary file, therefore you have to look at it from Processing (Reading, Updating), Backing-up, and Restoring point of views.
  • SAS SPD Engine data set: Best suited for processing large tables and reducing response times. It requires deeper technical skills to setup and use. Check out the following link for additional details SAS® Scalable Performance Data Engine, There are links to When to use, System Requirements, and other useful links.

From SAS Coding point of view

I think you have already got some wonderful and insightful responses already. There is not much I can add here.

Good luck,

Ahmed

ehsanmath
Obsidian | Level 7

Since I am very new to SAS, most of  these concepts are quite new for me. I am highly thankful to all of you for sharing wonderful idea's.

at Jaap Karman: thanks a lot for such a long answer. You have asked few questions. yes, I have SAS Datasets (that I read from .CSV's) which are only on one Hard Drive.

Let me start working on these ideas. In case of further (concrete) Questions I will write again.

A lots of thanks once again

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 20812 views
  • 12 likes
  • 5 in conversation