BookmarkSubscribeRSS Feed
SAS242424
Calcite | Level 5

Introduction

Class action lawsuits often require handling extremely large datasets, sometimes exceeding 100 million records with more than 50 columns. Efficiently sorting these files is critical for analysis and reporting. In this report, I discuss my experience sorting such large files using SAS on an MSI laptop equipped with the Nvidia chip.

 

Hardware and Software Setup

  • Laptop: MSI with Nvidia chip Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz 2.59 GHz
  • 16.0 GB (15.8 GB usable)
  • 64-bit operating system, x64-based processor
  • MSI (Micro-Star International) is a well-known manufacturer of high-performance laptops, particularly suited for gaming and data-intensive tasks. The model used in this case features high-speed processing and robust cooling capabilities.
  • Processor: The laptop includes an Nvidia GPU (Graphics Processing Unit), which enhances computational efficiency, particularly for parallel processing tasks. While primarily used for graphics rendering, Nvidia chips can significantly accelerate data processing tasks when properly leveraged.
  • Software: SAS (Statistical Analysis System) is a powerful software suite used for data management, advanced analytics, and statistical modeling. It is widely used in industries such as healthcare, finance, and legal analytics for handling large datasets efficiently.
  • File Type: Converted to SAS dataset format (sas7bdat) for improved performance.

 

Initial Attempt and Challenges

Initially, I attempted to sort the entire file without downsizing. However, this resulted in an immediate laptop crash. While the MSI laptop recovered, I had to manually delete some work files generated during the crash to free up space and restore functionality. More RAM would have helped.

 

Optimizing the Sorting Process

After experiencing the crash, I attempted to downsize the file multiple times to determine a manageable size for efficient sorting. Key findings include:

 

  1. Sorting 25 Million Records: Sorting this size was practical, completing in a reasonable time—approximately the time it takes to pour a cup of coffee.
  2. Sorting 15 Million Records: This size offered even faster sorting speeds without straining system resources.
  3. Sorting Larger Files: Any dataset significantly exceeding 25 million records risked performance issues or crashes.

 

Advanced Sorting Techniques

Advanced sorting techniques for very large datasets, were not used but are listed next:

Use PROC SORT with TAGSORT option:

This can help overcome insufficient disk space issues.

 

Utilize the SORTSIZE option:

Set SORTSIZE to limit the amount of available memory to about 1 or 2 megabytes to prevent unnecessary swapping. For example:

 

PROC SORT data=large_dataset SORTSIZE=2M;
 BY key_variable;
RUN;

 

 

Conclusion

For optimal performance on an MSI laptop with an Nvidia chip, sorting SAS datasets should be limited to approximately 15–25 million records at a time. Converting raw data into the SAS dataset format (sas7bdat) significantly improves sorting efficiency. By implementing advanced sorting techniques such as TAGSORT, and optimizing SORTSIZE, it's possible to further enhance sorting performance for very large datasets in class action lawsuit analytics.

 

Other Reports by Melvin Ott:

3 REPLIES 3
ChrisHemedinger
Community Manager

Thank you for sharing your experience, Dr. Ott. For readability, I pulled the content of your PDF attachment into the body of the message so that more community members might see it. Others may comment with other sort tips and experiences.

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!
Patrick
Opal | Level 21

@SAS242424 Thanks for sharing!

Here my five cents:

I guess what's "right" will very much depend on your data, your environment, the requirements and the usage of your data. 

I assume with "class action data" you actually want to query your data in different ways - like once per claim type, the next time per case status and then per ... If that's true then I guess no single sort order will suffice to avoid "full table scans". 

@ChrisHemedinger was apparently too humble to cite himself but it might be worth your while to read Sorting data in SAS: can you skip it?

With your data and SAS on a laptop it might be worth to consider storing the data on your c-drive under a library with the SPDE engine and with indexes created that match your most common where clauses or by groups.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1341 views
  • 1 like
  • 4 in conversation