BookmarkSubscribeRSS Feed
mkeintz
PROC Star

@ChrisNZ wrote:

Wonderful post as always @mkeintz .

A few notes:

1. The code fails if numeric and character variables are not present

Agreed.

 

 

2. Using function cat() for the string passed to md5() can fail to identify duplicates correctly if missing character values exist.

How is that the case?  Unlike the other "cat" functions, I understand that CAT does not strip blanks - i.e. it acts like the "||" operator, so I do not expect it to create collisions prior to the MD5 function.   And that's why I didn't bother with creating arbitrary separators between the concatenated fields.

 

3. Using a long value for the data variable in the hash table is unneeded.

Agreed.  Although I hadn't considered _MD5 ($16) long.  But see comment to #4: 

 

4. Using check() rather than find() to query the hash table is faster 

Indubitably.  I had used FIND in some other code in which I double-checked for MD5 collisions - a program for the anal-retentive, or difficult bosses, but not needed here.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ChrisNZ
Tourmaline | Level 20

How is that the case?  Unlike the other "cat" functions, I understand that CAT does not strip blanks - i.e. it acts like the "||" operator, so I do not expect it to create collisions prior to the MD5 function.   And that's why I didn't bother with creating arbitrary separators between the concatenated fields.

True, my bad. But then you potentially have a very long string to hash. It's still better to trim and separate imho.

 

Agreed.  Although I hadn't considered _MD5 ($16) long.  But see comment to #4: 

The issue we're trying to avoid here is running out of memory (at the cost of CPU, used to derive the hash).

Having said that, SAS hash tables are quite wasteful, and I just checked that a KEY length of 16 and DATA length of 16 occupies 64 bytes per hash table item, just as long as a KEY length of 16 and DATA length of 1. So that's one of my "improvements" gone.  

Kurt_Bremser
Super User

You need space for the whole uncompressed dataset in your WORK. So you need to know more about your dataset:

  • observation count
  • observation size
  • compressed: yes/no
  • physical file size of the dataset
  • if compressed, compression rate

The latter can be determined by copying a sufficient subset (say, 1 million obs) to a compressed dataset in WORK and looking at the log.

ChrisNZ
Tourmaline | Level 20

On top of the other valid suggestions (free space, use select distinct if you don't care about order), two more suggestions:

- Maybe you don't need this step all, what comes next?

- Copy the table in SPDE format and it will be sorted on the fly.

  This is very efficient and might require less space then proc sort, I have never looked into the space requirements.

Something like

data TEST SPEEDY.TEST(compress=binary);   
  retain A1-A99 0;
  do I=1e5 to 1 by -1; output; output; end; 
run;

proc sort data=TEST out=TEST1 nodup; by _ALL_; run; * current process;

data TEST2;                                         * SPDE process;
  set SPEEDY.TEST;
  by _ALL_;
  if md5(catx('|',of _ALL_)) ne lag( md5(catx('|',of _ALL_)) );
run;

CPU usage will be much higher though.

 

 

Patrick
Opal | Level 21

@mkeintz , @SASKiwi 

Just as a side note when using functions like catx() and SHA(): They are often limited to 32KB (at least under SAS 9.4) so careful with of _all_

Ksharp
Super User
Your dataset is too big for PROC SORT , try TAGSORT option.

proc sort data=TEST out=TEST1 nodup tagsort sortsize=max ;
by _ALL_;
run;
Kurt_Bremser
Super User

TAGSORT is meant to reduce the size of the utility file by putting only the key variable(s) and the observation pointer into it. If all variables need to go into it anyway, TAGSORT has no effect.

Ksharp
Super User
1) Try PROC SQL
proc sql;
create table want as
select distinct * from have;
quit;

2)Try batch process:
https://communities.sas.com/t5/SAS-Programming/Insufficient-space-in-file-WORK-SASTMP-000000024-n-UT...
mkeintz
PROC Star

@b0guna01 

 

You have gotten a few suggestions on this topic.

 

But ... it would improve the quality and efficiency of responses if you told us whether your goal is only the removal of duplicates.

 

Do you really need a dataset ordered by _ALL_?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 24 replies
  • 4903 views
  • 11 likes
  • 11 in conversation