topic Re: Accessing a huge SAS data file in SAS Programming

Accessing a huge SAS data file

b0guna01 — Mon, 26 Apr 2021 19:35:13 GMT

I am working with a huge SAS data file (~ 50M observations). When I run it, it says I don't have space. Please see below the log message I got. Could anyone help me to resolve this issue? Thanks.

Re: Accessing a huge SAS data file

Reeza — Mon, 26 Apr 2021 19:41:23 GMT

You need 3x the space of a data set to sort it.
So if you have a 10GB data set do you have 30GB free to sort it? If not you'll need to find a different option - split the file into smaller portions or consider an INDEX instead.
Sorting by _all_ is also incredibly time intensive and kind of a weird thing to do on such a large data set.
I would have expected a more specified sort...

Re: Accessing a huge SAS data file

ballardw — Mon, 26 Apr 2021 20:23:13 GMT

I am not sure if there may not be a space limit because of operations behind the scenes but

proc sql;

create table want as

select distinct *

from have

;

quit;

has a small chance of working.

Post log and code as text by copying text from the log or editor, opening a text box using the forum </> icon and then pasting.

It is extremely difficult to code from pictures and I for one am too lazy to retype code from a picture.

Sometimes code is close to working but if I have to retype a lot of stuff to make one small change I'm likely not to. If text is provided then it is easy to edit or simply highlight things that need to change which isn't really easy with pictures.

Re: Accessing a huge SAS data file

SASKiwi — Mon, 26 Apr 2021 23:05:32 GMT

Looks like you are running SAS locally on your PC so you can easily free up space on your C drive, if there are are a lot of files you don't want to keep including old SAS WORK folders. If D is also a local drive then you could consider using that for SAS WORK also. Don't use remote drives for SAS WORK folders as it will totally kill your performance.

Re: Accessing a huge SAS data file

Patrick — Tue, 27 Apr 2021 00:31:18 GMT

If you need to free-up space on your disk, there are some really nice open source tools which help you understand what takes up space and what you could delete. I like WinDirStat a lot.

Re: Accessing a huge SAS data file

mkeintz — Tue, 27 Apr 2021 00:36:37 GMT

You are doing a PROC SORT with a BY _ALL_. Given that you are also specifying NODUP, I don't think you care about data order as much as you merely want to remove duplicates.

Is your real goal just to eliminate duplicate records?

If so, there are ways (hash objects applied to MD5 or SHA256 encryption applied against a concatenation of all your variables) that can be used to eliminate duplicates without the burden of a sort.

Re: Accessing a huge SAS data file

Kurt_Bremser — Tue, 27 Apr 2021 06:07:21 GMT

You need space for the whole uncompressed dataset in your WORK. So you need to know more about your dataset:

observation count
observation size
compressed: yes/no
physical file size of the dataset
if compressed, compression rate

The latter can be determined by copying a sufficient subset (say, 1 million obs) to a compressed dataset in WORK and looking at the log.

Re: Accessing a huge SAS data file

ChrisNZ — Tue, 27 Apr 2021 09:04:59 GMT

On top of the other valid suggestions (free space, use select distinct if you don't care about order), two more suggestions:

- Maybe you don't need this step all, what comes next?

- Copy the table in SPDE format and it will be sorted on the fly.

This is very efficient and might require less space then proc sort, I have never looked into the space requirements.

Something like

data TEST SPEEDY.TEST(compress=binary);   
  retain A1-A99 0;
  do I=1e5 to 1 by -1; output; output; end; 
run;

proc sort data=TEST out=TEST1 nodup; by _ALL_; run; * current process;

data TEST2;                                         * SPDE process;
  set SPEEDY.TEST;
  by _ALL_;
  if md5(catx('|',of _ALL_)) ne lag( md5(catx('|',of _ALL_)) );
run;

CPU usage will be much higher though.

Re: Accessing a huge SAS data file

Patrick — Tue, 27 Apr 2021 09:25:20 GMT

@mkeintz , @SASKiwi

Just as a side note when using functions like catx() and SHA(): They are often limited to 32KB (at least under SAS 9.4) so careful with of _all_

Re: Accessing a huge SAS data file

Ksharp — Tue, 27 Apr 2021 12:51:42 GMT

Your dataset is too big for PROC SORT , try TAGSORT option.

proc sort data=TEST out=TEST1 nodup tagsort sortsize=max ;
by _ALL_;
run;

Re: Accessing a huge SAS data file

ChrisNZ — Tue, 27 Apr 2021 20:56:55 GMT

@Ksharp

TAGSORT

will not work with

by _ALL_;

Re: Accessing a huge SAS data file

Kurt_Bremser — Tue, 27 Apr 2021 21:00:41 GMT

TAGSORT is meant to reduce the size of the utility file by putting only the key variable(s) and the observation pointer into it. If all variables need to go into it anyway, TAGSORT has no effect.

Re: Accessing a huge SAS data file

Ksharp — Wed, 28 Apr 2021 12:01:17 GMT

1) Try PROC SQL
proc sql;
create table want as
select distinct * from have;
quit;

2)Try batch process:
https://communities.sas.com/t5/SAS-Programming/Insufficient-space-in-file-WORK-SASTMP-000000024-n-UTILITY/m-p/737449#M229910

Re: Accessing a huge SAS data file

mkeintz — Wed, 28 Apr 2021 14:26:45 GMT

@b0guna01

You have gotten a few suggestions on this topic.

But ... it would improve the quality and efficiency of responses if you told us whether your goal is only the removal of duplicates.

Do you really need a dataset ordered by _ALL_?

Re: Accessing a huge SAS data file

b0guna01 — Fri, 30 Apr 2021 13:50:01 GMT

My computer has enough space but still takes around 6 hours.

Re: Accessing a huge SAS data file

b0guna01 — Fri, 30 Apr 2021 13:53:29 GMT

Yes, I am accessing the computer through a VPN.

Re: Accessing a huge SAS data file

b0guna01 — Fri, 30 Apr 2021 13:55:29 GMT

we already cleaned, but we don't see a huge difference in terms of timing.

Re: Accessing a huge SAS data file

b0guna01 — Fri, 30 Apr 2021 13:56:51 GMT

I want to remove the duplicates before doing the data analysis. It has 124 variables.

Re: Accessing a huge SAS data file

SASKiwi — Fri, 30 Apr 2021 23:55:21 GMT

@b0guna01 - How you access the computer is irrelevant to your problem. If you are sending your data across a network to or from remote storage then it is definitely relevant to your problem

Re: Accessing a huge SAS data file

mkeintz — Sat, 01 May 2021 03:19:57 GMT

OK, it's just de-duping. Then you can replicate the NODUP + BY _ALL_ results by using the MD5 function and a hash object:

proc sql noprint;
  /* Make a csv list of PUT(X,rb8) for all x's that are numeric variables */
  select cats('put(',name,',rb8.)') into :num_to_rb8 separated by ','
  from dictionary.columns 
  where libname='PROJECT' and memname='MEDICAID_V01_2010' and type='num';

  /* Get the total length of a single observation */
  select obslen into :concat_len
  from dictionary.tables
  where libname='PROJECT' and memname='MEDICAID_V01_2010';
quit;
%put &=num_to_rb8 ;
%put &=obslen ;

data want (drop=_:) ;
  set PROJECT.MEDICAID_V01_2010;

  /* Concatenate all the data into a single string ("message") named _CONCAT */
  length _concat $&concat_len ;
  _concat=cat(&num_to_rb8,of _character_);

  /* Make a "unique" signature for the message */
  length _md5 $16;
  _md5=md5(_concat);

  if _n_=1 then do;
    declare hash md5 (hashexp:10);
      md5.definekey('_md5');
      md5.definedata('_md5');
      md5.definedone();
  end;

  if md5.find()^=0 then do;
    output;  /*Output first obs for a given signature*/
    md5.add();
  end;
run;

I do not directly list numeric variables as arguments of the CAT (or CATX) functions, because different numeric values can generate matching _concat values, (in turn generating matching md5 values), destroying the whole point of de-duping here. Consider the two concatenations below, where X ^= Y, but cat(x,x)=cat(x,y):

data _null_;
  x=0.1234567890123456;
  y=0.1234567890123457;
  if x=y then put "X Equals Y" ;
  else put "X Does NOT Equal Y";

  cat_x_x = cat(x,x);
  cat_x_y = cat(x,y);
  if cat_x_x=cat_x_y then put "CAT(X,X) DOES Equal CAT(X,Y)";
  else put "CAT(X,X) does NOT = CAT(X,Y)";
run;

which generates the log

X Does NOT Equal Y
CAT(X,X) DOES Equal CAT(X,Y)

In other contexts (for instance when using the MD5 function), this is known as a "collision" - where distinct values of the original data generate equivalent results. That's because the CAT family of functions convert the numeric values into text prior to concatenation, which does not always represent the value to the needed precision. You can avoid that by keeping the original numeric "real binary" representation by using:

  cat_x_x = cat(put(x,rb8.),put(x,rb8.));
  cat_x_y = cat(put(x,rb8.),put(y,rb8.));

If you rerun the modified program the log will say:

X Does NOT Equal Y
CAT(X,X) does NOT = CAT(X,Y)

That's why you see my PROC SQL code generating the macrovar &num_to_rb8.

The usual concern about a collision risk is in using the MD5 function, but that risk is very very .... very low. It is intended to generate distinct values. Citing page 339 of Data Management Solutions Using SAS Hash Table Operations by Paul Dorfman and Don Henderson:

In the worst case scenario, the approximate number of items that need to be hashed to get a 50 percent chance of an MD5 collision is about 2**64≃2E+19. It means that to encounter just 1 collision, the MD5 function has to be executed against 100 quintillion distinct arguments the equal number of times, i.e., approximately 1 trillion times per second for 100 years. The probability of such an event is so infinitesimally negligible that one truly has an enormously greater chance of living through a baseball season where every single pitch is a strike and no batter ever gets on base. (Amusingly, some people who will confidently say that can never, ever happen may believe that an MD5 collision can happen.)

Now if the number of true duplicates in the original data set is low, one could identify the records having duplicate MD5 values, and then confirm they all arise from true duplicate observations. I'm not including such code here.