If you are dealing with very large character fields you should attempt to save space by performing md5 hash on values prior to entering the hash step. Here is example using random generated data on 100,000 record set. System reported memory usage again at just about 100MB total max according to fullstimer and at OS level max virtual size of process was between 450-500MB, the test file is 10MB in size on disk. The operation took about 36 seconds between multiple runs. To test vrs larger string I use put fuction on the md5 hash output to format as binary128. This would be same as having every char field in OP's original data being of length $128. Memory usage was slightly higher about 160MB on average with VIRT of about 520MB. The test file is 47MB in size on disk now. About 41 seconds of run time between multiple runs. In short, I have seen increased performance by utilizing 'call sortc' function for dataprep as well as md5 hash for decreasing size of large character fields. The only issue is that both of these methods obscure your data. So it would be good to add a finder key to link the generated household back to the original data. I have spent enough time on this for right now but I will later. test data sent to KSharp's process: data test; call streaminit( 12345 ); array a[4] $200 pan1-pan3 add1; do i=1 to 10**5; do j=1 to dim(a); a =/*put(*/md5(abs(mod(int(rand('cauchy')*10**5),10**6)))/*,binary128.)*/; end; output; end; call sortc(of a ); drop i j; run; sas_Forum Can you please share your memory and performance options settings as I had asked before. Also about available free memory on your computer, what version of windows are you running specifically? As art297 inquired, where are you sourceing this data from as it goes into the hash step, are you pulling across network, disk, tape, etc...???
... View more