Class action lawsuits often require handling extremely large datasets, sometimes exceeding 100 million records with more than 50 columns. Efficiently sorting these files is critical for analysis and reporting. In this report, I discuss my experience sorting such large files using SAS on an MSI laptop equipped with the Nvidia chip.
Initially, I attempted to sort the entire file without downsizing. However, this resulted in an immediate laptop crash. While the MSI laptop recovered, I had to manually delete some work files generated during the crash to free up space and restore functionality. More RAM would have helped.
After experiencing the crash, I attempted to downsize the file multiple times to determine a manageable size for efficient sorting. Key findings include:
Advanced sorting techniques for very large datasets, were not used but are listed next:
Use PROC SORT with TAGSORT option:
This can help overcome insufficient disk space issues.
Utilize the SORTSIZE option:
Set SORTSIZE to limit the amount of available memory to about 1 or 2 megabytes to prevent unnecessary swapping. For example:
PROC SORT data=large_dataset SORTSIZE=2M;
BY key_variable;
RUN;
For optimal performance on an MSI laptop with an Nvidia chip, sorting SAS datasets should be limited to approximately 15–25 million records at a time. Converting raw data into the SAS dataset format (sas7bdat) significantly improves sorting efficiency. By implementing advanced sorting techniques such as TAGSORT, and optimizing SORTSIZE, it's possible to further enhance sorting performance for very large datasets in class action lawsuit analytics.
Thank you for sharing your experience, Dr. Ott. For readability, I pulled the content of your PDF attachment into the body of the message so that more community members might see it. Others may comment with other sort tips and experiences.
@SAS242424 Thanks for sharing!
Here my five cents:
I guess what's "right" will very much depend on your data, your environment, the requirements and the usage of your data.
I assume with "class action data" you actually want to query your data in different ways - like once per claim type, the next time per case status and then per ... If that's true then I guess no single sort order will suffice to avoid "full table scans".
@ChrisHemedinger was apparently too humble to cite himself but it might be worth your while to read Sorting data in SAS: can you skip it?
With your data and SAS on a laptop it might be worth to consider storing the data on your c-drive under a library with the SPDE engine and with indexes created that match your most common where clauses or by groups.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.