Accessing a huge SAS data file - Page 2

mkeintz · Posted 05-12-2021 08:58 AM

@ChrisNZ wrote:

Wonderful post as always @mkeintz .

A few notes:

1. The code fails if numeric and character variables are not present

Agreed.

2. Using function cat() for the string passed to md5() can fail to identify duplicates correctly if missing character values exist.

How is that the case? Unlike the other "cat" functions, I understand that CAT does not strip blanks - i.e. it acts like the "||" operator, so I do not expect it to create collisions prior to the MD5 function. And that's why I didn't bother with creating arbitrary separators between the concatenated fields.

3. Using a long value for the data variable in the hash table is unneeded.

Agreed. Although I hadn't considered _MD5 ($16) long. But see comment to #4:

4. Using check() rather than find() to query the hash table is faster

Indubitably. I had used FIND in some other code in which I double-checked for MD5 collisions - a program for the anal-retentive, or difficult bosses, but not needed here.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ChrisNZ · Posted 05-12-2021 05:09 PM

How is that the case? Unlike the other "cat" functions, I understand that CAT does not strip blanks - i.e. it acts like the "||" operator, so I do not expect it to create collisions prior to the MD5 function. And that's why I didn't bother with creating arbitrary separators between the concatenated fields.

True, my bad. But then you potentially have a very long string to hash. It's still better to trim and separate imho.

Agreed. Although I hadn't considered _MD5 ($16) long. But see comment to #4:

The issue we're trying to avoid here is running out of memory (at the cost of CPU, used to derive the hash).

Having said that, SAS hash tables are quite wasteful, and I just checked that a KEY length of 16 and DATA length of 16 occupies 64 bytes per hash table item, just as long as a KEY length of 16 and DATA length of 1. So that's one of my "improvements" gone.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 04-27-2021 02:07 AM

You need space for the whole uncompressed dataset in your WORK. So you need to know more about your dataset:

observation count
observation size
compressed: yes/no
physical file size of the dataset
if compressed, compression rate

The latter can be determined by copying a sufficient subset (say, 1 million obs) to a compressed dataset in WORK and looking at the log.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 04-27-2021 05:04 AM

On top of the other valid suggestions (free space, use select distinct if you don't care about order), two more suggestions:

- Maybe you don't need this step all, what comes next?

- Copy the table in SPDE format and it will be sorted on the fly.

This is very efficient and might require less space then proc sort, I have never looked into the space requirements.

Something like

data TEST SPEEDY.TEST(compress=binary);   
  retain A1-A99 0;
  do I=1e5 to 1 by -1; output; output; end; 
run;

proc sort data=TEST out=TEST1 nodup; by _ALL_; run; * current process;

data TEST2;                                         * SPDE process;
  set SPEEDY.TEST;
  by _ALL_;
  if md5(catx('|',of _ALL_)) ne lag( md5(catx('|',of _ALL_)) );
run;

CPU usage will be much higher though.

High-Performance SAS Coding - Third Edition

Patrick · Posted 04-27-2021 05:25 AM

@mkeintz , @SASKiwi

Just as a side note when using functions like catx() and SHA(): They are often limited to 32KB (at least under SAS 9.4) so careful with of _all_

Ksharp · Posted 04-27-2021 08:51 AM

Your dataset is too big for PROC SORT , try TAGSORT option.

proc sort data=TEST out=TEST1 nodup tagsort sortsize=max ;
by _ALL_;
run;

ChrisNZ · Posted 04-27-2021 04:56 PM

@Ksharp

TAGSORT

will not work with

by _ALL_;

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 04-27-2021 05:00 PM

TAGSORT is meant to reduce the size of the utility file by putting only the key variable(s) and the observation pointer into it. If all variables need to go into it anyway, TAGSORT has no effect.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 04-28-2021 08:01 AM

1) Try PROC SQL
proc sql;
create table want as
select distinct * from have;
quit;

2)Try batch process:
https://communities.sas.com/t5/SAS-Programming/Insufficient-space-in-file-WORK-SASTMP-000000024-n-UT...

mkeintz · Posted 04-28-2021 10:26 AM

@b0guna01

You have gotten a few suggestions on this topic.

But ... it would improve the quality and efficiency of responses if you told us whether your goal is only the removal of duplicates.

Do you really need a dataset ordered by _ALL_?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Re: Accessing a huge SAS data file

Re: Accessing a huge SAS data file

Re: Accessing a huge SAS data file

Re: Accessing a huge SAS data file

Re: Accessing a huge SAS data file

Re: Accessing a huge SAS data file

Re: Accessing a huge SAS data file

Re: Accessing a huge SAS data file

Re: Accessing a huge SAS data file

Re: Accessing a huge SAS data file

Registration is open

Call for Content EXTENDED

Registration is open

Call for Content EXTENDED

SAS Training: Just a Click Away