@ChrisNZ wrote:
Wonderful post as always @mkeintz .
A few notes:
1. The code fails if numeric and character variables are not present
Agreed.
2. Using function cat() for the string passed to md5() can fail to identify duplicates correctly if missing character values exist.
How is that the case? Unlike the other "cat" functions, I understand that CAT does not strip blanks - i.e. it acts like the "||" operator, so I do not expect it to create collisions prior to the MD5 function. And that's why I didn't bother with creating arbitrary separators between the concatenated fields.
3. Using a long value for the data variable in the hash table is unneeded.
Agreed. Although I hadn't considered _MD5 ($16) long. But see comment to #4:
4. Using check() rather than find() to query the hash table is faster
Indubitably. I had used FIND in some other code in which I double-checked for MD5 collisions - a program for the anal-retentive, or difficult bosses, but not needed here.
How is that the case? Unlike the other "cat" functions, I understand that CAT does not strip blanks - i.e. it acts like the "||" operator, so I do not expect it to create collisions prior to the MD5 function. And that's why I didn't bother with creating arbitrary separators between the concatenated fields.
True, my bad. But then you potentially have a very long string to hash. It's still better to trim and separate imho.
Agreed. Although I hadn't considered _MD5 ($16) long. But see comment to #4:
The issue we're trying to avoid here is running out of memory (at the cost of CPU, used to derive the hash).
Having said that, SAS hash tables are quite wasteful, and I just checked that a KEY length of 16 and DATA length of 16 occupies 64 bytes per hash table item, just as long as a KEY length of 16 and DATA length of 1. So that's one of my "improvements" gone.
You need space for the whole uncompressed dataset in your WORK. So you need to know more about your dataset:
The latter can be determined by copying a sufficient subset (say, 1 million obs) to a compressed dataset in WORK and looking at the log.
On top of the other valid suggestions (free space, use select distinct if you don't care about order), two more suggestions:
- Maybe you don't need this step all, what comes next?
- Copy the table in SPDE format and it will be sorted on the fly.
This is very efficient and might require less space then proc sort, I have never looked into the space requirements.
Something like
data TEST SPEEDY.TEST(compress=binary);
retain A1-A99 0;
do I=1e5 to 1 by -1; output; output; end;
run;
proc sort data=TEST out=TEST1 nodup; by _ALL_; run; * current process;
data TEST2; * SPDE process;
set SPEEDY.TEST;
by _ALL_;
if md5(catx('|',of _ALL_)) ne lag( md5(catx('|',of _ALL_)) );
run;
CPU usage will be much higher though.
TAGSORT is meant to reduce the size of the utility file by putting only the key variable(s) and the observation pointer into it. If all variables need to go into it anyway, TAGSORT has no effect.
You have gotten a few suggestions on this topic.
But ... it would improve the quality and efficiency of responses if you told us whether your goal is only the removal of duplicates.
Do you really need a dataset ordered by _ALL_?
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.