Thanks again for the perspectives, Jaap and Tom. Internally, some performance issues around profiling of large data sets were discussed. For example, details about general improvements within the Data Management Platform that can help address things from a memory management standpoint and information about a setting that can be used to configure for large data sets (those with many columns): Starting with DMP release 2.4, performance of frequency distribution calculations has been significantly improved over previous releases when dealing with large data sets (defined here to mean when the size of input data exceeds the size of memory allocated for frequency distribution by a number of times, resulting in multiple memory dump files). In some test cases, performance has been seen to improve by over an order of magnitude. In DMP release 2.3 and earlier, prof/per_table_bytes option was introduced into app.cfg to configure the amount of memory to be used per column profiled. It typically needs to be set only when profiling hundreds of columns. Starting with DMP release 2.4, a change was made to how the profile engine uses memory. It can lead to the same profiling job using more overall system memory than it did in the past, however, controls are still provided for how much memory to use per column profiled. When using Profile, app.cfg still supports “prof/per_table_bytes” option. When using the Frequency Distribution data node, the new HASH_BUCKETS property is now supported, but the old one, HASH_TABLE_SIZE, is still recognized and supported in cases when an old job is loaded and run. If the option is not set (regardless of the format), the profile engine uses the default value of 1024x1024 = 1048576 buckets (4MBs or 8MBs per table / column profiled). You also might be interested in the fact that SAS is developing a new profiling engine that will perform “in-database”, meaning that the work involved to general a profile report will be capable of running within a data source, rather than relying upon data extraction to a Data Management Server or Data Quality Server where the calculations are done. Leveraging the typically greater hardware resources of the data source can significantly improve profiling performance for very large tables or data sets. Any other suggestions or comments about profiling large data sets? Keep them coming!
... View more