05-06-2014 04:22 PM
A recent topic that came up here at SAS was about the best ways to profile large text files. I wanted to share it to get your perspective, similar to what Anna-Marie posted on commit intervals. After a week or so, I’ll post ideas shared by SAS employees.
“What is the best way to profile large text files? For example, say you want to profile a CSV file with 100 million rows and over 100 columns. What sort of performance and architecture considerations might come into play? What are some strategies or best practices for making this work as efficiently as possible?"
05-06-2014 04:35 PM
You say text files but then 100 columns, is that then a data file with numbers in a text format?
Or is it a text document like a book or article?
05-06-2014 04:49 PM
The example was meant to be numeric or character data represented as comma-separated values in a simple text file. For example:
Joe Blogs,123 Main Street,Anytown,…
Joe Blow,321 Front Street,Specifictown,…
05-06-2014 07:23 PM
Interesting problem. It's similar to one I've faced a few times. (All examples using SAS, unless otherwise specified.)
Question 1: Is the file delimited into rows? (I've dealt with cases where it's just a stream of characters). To assess this, I'll run through it, checking the length of each _INFILE_ result. Analysing the results should give me an idea of whether I at least have records to work with.
Question 2: Is the file well enough delimited to pull into something like SAS. To resolve this, I'll do things like count the number of (assumed) delimiters, in this case commas, in each record. Checking the results, and then troubleshooting the exceptions, will hopefully let me separate the records into fields (using options like DSD).
Assuming I can parse the data into fields, we're in pretty good shape. At this point, we're still assuming everything is text. Figuring out the max size of each field is next, if we have hundreds of fields macro techniques are a lifesaver. Hopefully, the lengths of the strings are such that using the SAS fixed string length won't kill us. If we have cases where almost all of the values for a field are short, but a few are very long, they'll need to be treated as outliers.
So what's in our fields? Up till now, I've been pretending that I haven't looked at the data, but of course I have. Even if I haven't, the technique described in https://communities.sas.com/message/145163#145163 can be used to determine if all of the values in a field can be converted to numbers. It's worth checking the ratios; if the vast majority can, and a few can't, they may be outliers (e.g. error messages), and should be converted to MISSING.
At this point, we have a SAS dataset containing a mix of character and numeric fields. Enterprise Guide has a nice "Characterize Data" task, which is probably as reasonable an approach as any. From what I've seen, Visual Analytics would eat data like this for lunch.
Do we want to discuss things like looking for correlations between variables? Again, VA will probably show us dozens of (probably invalid) correlations.
I assume that a key element of your question is that this file is LARGE! The first couple of steps will easily chew through anything into the hundreds of millions of records just on my PC. Beyond that, things may get too slow. So, to speed things up:
Other data formats to consider
Pure text (scanned books?), HTML off the web, program code, all present interesting challenges.
05-07-2014 03:24 AM
I have read back the commit-interval post and than evaluating this one. I am getting the feeling you are back to some fundamentals of the technical IT area.
As all processing is done on some hardware, understanding what the impact is of your actions on that can be very important. The over simplified approach that hardware is not important as unlimited by speed and involving no cost is incorrect. A response time for result of 3 seconds or 3 weeks is making a big difference on the experience and being practical for usage.
With the "big data" hype the hardware lay-out is changing. I remember maianframe of 1980's pc-s of 1982 and seeing the current hardware as a rackmounted 2u dell 720. Modern hardware like this with 64 cores 1Tb memory some 3Tb or much more harddisk connected with several 10 G/bs network adapters are commercial options today. That makes a 100M rows CSV flat file a small one.
The problems is how to process that. In the old days people are thinking in a single processor. But the increase in processor speed has dropped, stable since ca 2003 (Herb Sutter - Intel) you must go for parallel processing, grid like or multithreading. Hadoop is a grid-like approach with some other properties.
Back to the commit question.
IO is the most time consuming part of a process. Can you move this into the memory resource your response time can drop.
Avoiding IO is done by minimizing commits. Touching a whole dataset is often faster as sequential access to some random approach.
Organizing the data on the dasd (I/O) different. This is why an OLTP database (Oracle, DB/2) is different as an analytics database (Vertica,Teradata).
Same to the big flat file text processing.
- Can you divide the processing in different parts (10 file of 1 M records) you can spread it. Causing higher loads but gaining with the response time.
- Is your process still long running think about checkpoint/restart. A restart when needed losing some hour could be better than one losing weeks.
- Is processing possible while using as much as useful possible of memory (cpu) instead doing that by IO(dasd).
This approach will need some acceptance of seeing and managing heavy load's on the system. A load balancer (LSF) and/or workload-manager (WlM Cgroup) is needed.
Efficiency is seeking a balance between cost/investments and profits.
05-08-2014 11:46 AM
I love the thoughtful commentary and suggestions, guys! Nice points on the 'big data' hype, Jaap. Parallel processing, et al, seems to be the way forward for sure.
Anyone else have ideas on ways to approach this?
05-19-2014 10:31 AM
Thanks again for the perspectives, Jaap and Tom. Internally, some performance issues around profiling of large data sets were discussed. For example, details about general improvements within the Data Management Platform that can help address things from a memory management standpoint and information about a setting that can be used to configure for large data sets (those with many columns):
Starting with DMP release 2.4, performance of frequency distribution calculations has been significantly improved over previous releases when dealing with large data sets (defined here to mean when the size of input data exceeds the size of memory allocated for frequency distribution by a number of times, resulting in multiple memory dump files). In some test cases, performance has been seen to improve by over an order of magnitude.
In DMP release 2.3 and earlier, prof/per_table_bytes option was introduced into app.cfg to configure the amount of memory to be used per column profiled. It typically needs to be set only when profiling hundreds of columns.
Starting with DMP release 2.4, a change was made to how the profile engine uses memory. It can lead to the same profiling job using more overall system memory than it did in the past, however, controls are still provided for how much memory to use per column profiled.
When using Profile, app.cfg still supports “prof/per_table_bytes” option. When using the Frequency Distribution data node, the new HASH_BUCKETS property is now supported, but the old one, HASH_TABLE_SIZE, is still recognized and supported in cases when an old job is loaded and run.
If the option is not set (regardless of the format), the profile engine uses the default value of 1024x1024 = 1048576 buckets (4MBs or 8MBs per table / column profiled).
You also might be interested in the fact that SAS is developing a new profiling engine that will perform “in-database”, meaning that the work involved to general a profile report will be capable of running within a data source, rather than relying upon data extraction to a Data Management Server or Data Quality Server where the calculations are done. Leveraging the typically greater hardware resources of the data source can significantly improve profiling performance for very large tables or data sets.
Any other suggestions or comments about profiling large data sets? Keep them coming!