@Quentin wrote:
I like the idea of coding up the same algorithm using multiple approaches to compare efficiency.
That said, I think you should work to get all the approaches to match in their output.
This should not happen. If SAS runs out of memory, you should get an error in the log. You definitely should not get the wrong result (with no error). if you really have a case where SQL is giving you the wrong result, I would send it in to tech support. Same for your statement that hash approach had some discrepancies. I think it's likely that there are some edge cases in your data that are falling through some cracks in your code. But if you have a repeatable example of discrepancy (especially one where the results of the code vary with the amount of memory available to the SAS session), please send it in to tech support.
Also confused by your statement that the output dataset from the hash approach was double the size of other approaches. If the output datasets from each approach are identical (e.g. judged via PROC COMPARE to compare the metadata and data), this shouldn't happen. Unless maybe you changed compression options.
Before this project, I had virtually no experience w/ arrays or hash objects. I don't know why the flag counts were slightly different. The only clue I had that RAM could be an issue was that Firefox crashed and presented a dialog box indicating that the crash was due to insufficient memory. When I manually reviewed the discrepant cases, the codes that matched for those cases had been successfully matched on many other cases. And after increasing the memory, the discrepancies disappeared.
It wasn't just the output filesize that was different among the various approaches, it was also the number of records they contained. So data compression wouldn't explain the differences. Perhaps Proc SQL does not continue to search for matches on a given flag after it has already matched that flag, whereas the hash version DOES continue to search? In that case, there could be duplicate records for a given case that match on the same flag, and that could explain the differences in filesize. I could run Proc SQL w/ select distinct on the hash output file to see if it finds and eliminates any duplicate records.
I will review all of this in an attempt to make sure I haven't made any errors. If I can't find any, I'll submit it to tech support. However, I'm getting busy with other projects, so I don't have as much time to dedicate to this right now. So it may take a while😐
... View more