Introduction
Class action lawsuits often require handling extremely large datasets, sometimes exceeding 100 million records with more than 50 columns. Efficiently sorting these files is critical for analysis and reporting. In this report, I discuss my experience sorting such large files using SAS on an MSI laptop equipped with the Nvidia chip.
Hardware and Software Setup
Laptop: MSI with Nvidia chip Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz 2.59 GHz
16.0 GB (15.8 GB usable)
64-bit operating system, x64-based processor
MSI (Micro-Star International) is a well-known manufacturer of high-performance laptops, particularly suited for gaming and data-intensive tasks. The model used in this case features high-speed processing and robust cooling capabilities.
Processor: The laptop includes an Nvidia GPU (Graphics Processing Unit), which enhances computational efficiency, particularly for parallel processing tasks. While primarily used for graphics rendering, Nvidia chips can significantly accelerate data processing tasks when properly leveraged.
Software: SAS (Statistical Analysis System) is a powerful software suite used for data management, advanced analytics, and statistical modeling. It is widely used in industries such as healthcare, finance, and legal analytics for handling large datasets efficiently.
File Type: Converted to SAS dataset format (sas7bdat) for improved performance.
Initial Attempt and Challenges
Initially, I attempted to sort the entire file without downsizing. However, this resulted in an immediate laptop crash. While the MSI laptop recovered, I had to manually delete some work files generated during the crash to free up space and restore functionality. More RAM would have helped.
Optimizing the Sorting Process
After experiencing the crash, I attempted to downsize the file multiple times to determine a manageable size for efficient sorting. Key findings include:
Sorting 25 Million Records: Sorting this size was practical, completing in a reasonable time—approximately the time it takes to pour a cup of coffee.
Sorting 15 Million Records: This size offered even faster sorting speeds without straining system resources.
Sorting Larger Files: Any dataset significantly exceeding 25 million records risked performance issues or crashes.
Advanced Sorting Techniques
Advanced sorting techniques for very large datasets, were not used but are listed next:
Use PROC SORT with TAGSORT option:
This can help overcome insufficient disk space issues.
Utilize the SORTSIZE option:
Set SORTSIZE to limit the amount of available memory to about 1 or 2 megabytes to prevent unnecessary swapping. For example:
PROC SORT data=large_dataset SORTSIZE=2M;
BY key_variable;
RUN;
Conclusion
For optimal performance on an MSI laptop with an Nvidia chip, sorting SAS datasets should be limited to approximately 15–25 million records at a time. Converting raw data into the SAS dataset format (sas7bdat) significantly improves sorting efficiency. By implementing advanced sorting techniques such as TAGSORT, and optimizing SORTSIZE, it's possible to further enhance sorting performance for very large datasets in class action lawsuit analytics.
Other Reports by Melvin Ott:
Leveraging SASPy for Efficient Analytics in Class Action
Sort Large Files Faster with SASPy
... View more