Desktop productivity for business analysts and programmers

Techniques To handle Very Very Large Dataset

Reply
N/A
Posts: 0

Techniques To handle Very Very Large Dataset

Hi,

I have a SAS dataset which is about 15 -16 GBytes (approx - 45,000,000 records).
It takes too much time to work and Analysize upon such large Dataset and it gives insufficient Memory Error sometimes.
we have taken care of the lengths of the variables and also use keep and drop options while processing.
we need to create reports on daily basis using this datasets. The SAS EG takes more than Hour to run whole project.
Are there any methods to optimise and reduce the time to its mininum.

Thank you,
Avi
Esteemed Advisor
Posts: 5,194

Re: Techniques To handle Very Very Large Dataset

These kind of questions are almost impossible to answer straight, need more information.
The bottom line question is what does the reports look like? Are they aggregates? Are they filtered? Techniques to make your data "workable":
-Normalize, let you have smaller, more specialized tables
-Indexing, let you query subsets with minimum resources used
-Doing pre-calculated aggregates, maybe combined with star-schema modelling
-If you are using Base engine for storage, evaluate moving to SPDE
-Are you running these processes on a local Windows SAS session? See if it's possible to upgrade internal memory, and run a 64-bit windows version.

Etc

Good luck
Linus
Data never sleeps
Valued Guide
Posts: 2,111

Re: Techniques To handle Very Very Large Dataset

One addition to Linus' list is to use the LIBNAME statement to get to your data. That way EGuide is only involved in the setting up and display and the actual analyses are all handled by the SAS server.
N/A
Posts: 0

Re: Techniques To handle Very Very Large Dataset

Hey Thanks for the reply.

Actualy my report is divided in various sub-reports.
The dataset contains data of 30days, and everyday approx 20 lakh records are appended to it and also last 30th day records are removed.

Then the data is grouped and filtered on various fields for each sub-report. For few reports the same field needs to be grouped to calculate a summarized figures. and futher these summarized figures are used to calculate the further statistics.

ex. time field - data grouped and avg, max, min.... are calculated for per 5sec, per session, per order type group again per hour, per session, per ortder type group. and same for whole day and 30 days group then merged to single DS.
Also graphs are ploted using avgs for all the groups in a whole day .

similarly data is grouped using different combinations to calculate various statistics. like N p1 p10 mean Q1 median Q3 P90 P99 ... ....

each report is printed using proc report, proc gplot in the sequence in one pdf.
It also uses lot of ODS layout , goptions and macro stuff to generate report.

how can i find out that project is optimised to its best ?....

Regards,
Avi Message was edited by: Avi
Respected Advisor
Posts: 3,823

Re: Techniques To handle Very Very Large Dataset

Linus already listed all the techniques which might help you.

What I like to do with daily dataloads and "rolling data" like your 30 days monster: I create a dataset per day and then a view containing these 30 days.

This has the advantage that all your reports using only one day can use the daily much smaller dataset and loading of a new day means only creating a new sas dataset (and not loading and deleting records in a 30 day dataset) and re-creating the view.

HTH
Patrick
Esteemed Advisor
Posts: 5,194

Re: Techniques To handle Very Very Large Dataset

Since your problem seems quite complex, I believe that this would be the right place to give a thorough solution. I suggest that contact a consultant for on/off-site guidance. To help you that person would probably need specification of your hardware, PROC CONTENTS of your data, sample data, sample reports, LOG-files etc.

/Linus
Data never sleeps
SAS Employee
Posts: 149

Re: Techniques To handle Very Very Large Dataset

A few thoughts... I'm sure you've already tried some of these.

1. Make sure to turn on the maximum performance statistics available with OPTIONS FULLSTIMER; You could set that in a program node. Also, make sure the project log is turned on so you can review all the log info at once. In EG 4.1 that's under File / Properties / Project Log. Consulting the project log with FULLSTIMER set will give you an idea of what steps are taking the most resources in terms of I/O, memory, CPU time, and real time.

2. Indexes can improve the performance of subsetting and joins. Since you're rebuilding the data every day, it might not be worth the overhead to recreate indexes, but if you do a lot of subsequent subsetting etc. it might be worth a try.

3. Joins may give you better performance that data step merges. Of course, you may be doing joins already.

4. Pre-sorting intelligently can sometimes reduce or eliminate insufficient memory messages. I'm thinking about procedures like PROC MEANS with a CLASS statement or PROC REPORT. Of course, sorting takes a lot of resources.

5. If you're firing off many separate types of analysis that are grouped by the same set of columns, pre-sorting based on that set of columns can reduce EG overhead.

6. Assuming you can get the insufficient memory problems under control, what's wrong with just scheduling your monster project to run at 4 AM each day? Ultimately, all the coding enhancements in the world will only get you so far since you're processing such a large volume of data. After that, you'd really have to upgrade your hardware and/or find a better storage medium for this much data (e.g. SPD file, Teradata, etc.).

Again, many of these suggestions you may have tried and/or not apply to your project. Maybe one of these will help! What steps, specifically, are generating the insufficient memory messages? PROC SQL? PROC REPORT?
N/A
Posts: 0

Re: Techniques To handle Very Very Large Dataset

Yes Richard, i have checked every step using FULLSTIMER.
from my experience i have found that using proc means with class statement, proc freq is more efficient than using sql using group by to summarize.
but some times it is to inevitable to use proc sql, if the where clause in on the calculated field (like sum of two other fields).

I did not put any index as my every column has many repeating values which are used to group the other columns which are facts.

I will try replacing few merges with join and use which ever yields best results.
also drawback of merge is that the data needs to be sorted .. so we may save some resources there.

yes my project is scheduled to run every day 11 PM. but i have to run it manually next day is some error has occured.
the insufficient memory problems occurs mostly in proc sort .. though i have used tagsort to reduce its chances . but i have to compensate it with time Smiley Happy (give and take ) ..

currently my big DataSet is stored as a sas DS in a permanent library .. since loading is fast as compare to storing it in a SQL DB.. which is only available.

also i have received Access violation error with some memory address with it for Proc report but only 1 -2 times over last 3 months. i was not able to find its reason.

Thanks for your suggestions.

Regards,
Avi
Ask a Question
Discussion stats
  • 7 replies
  • 137 views
  • 0 likes
  • 5 in conversation