Re: Accessing a huge SAS data file

b0guna01 · Posted 04-26-2021 03:34 PM

I am working with a huge SAS data file (~ 50M observations). When I run it, it says I don't have space. Please see below the log message I got. Could anyone help me to resolve this issue? Thanks.

Reeza · Posted 04-26-2021 03:41 PM

You need 3x the space of a data set to sort it.
So if you have a 10GB data set do you have 30GB free to sort it? If not you'll need to find a different option - split the file into smaller portions or consider an INDEX instead.
Sorting by _all_ is also incredibly time intensive and kind of a weird thing to do on such a large data set.
I would have expected a more specified sort...

b0guna01 · Posted 04-30-2021 09:50 AM

My computer has enough space but still takes around 6 hours.

ballardw · Posted 04-26-2021 04:23 PM

I am not sure if there may not be a space limit because of operations behind the scenes but

proc sql;

create table want as

select distinct *

from have

;

quit;

has a small chance of working.

Post log and code as text by copying text from the log or editor, opening a text box using the forum </> icon and then pasting.

It is extremely difficult to code from pictures and I for one am too lazy to retype code from a picture.

Sometimes code is close to working but if I have to retype a lot of stuff to make one small change I'm likely not to. If text is provided then it is easy to edit or simply highlight things that need to change which isn't really easy with pictures.

SASKiwi · Posted 04-26-2021 07:05 PM

Looks like you are running SAS locally on your PC so you can easily free up space on your C drive, if there are are a lot of files you don't want to keep including old SAS WORK folders. If D is also a local drive then you could consider using that for SAS WORK also. Don't use remote drives for SAS WORK folders as it will totally kill your performance.

b0guna01 · Posted 04-30-2021 09:53 AM

Yes, I am accessing the computer through a VPN.

SASKiwi · Posted 04-30-2021 07:55 PM

@b0guna01 - How you access the computer is irrelevant to your problem. If you are sending your data across a network to or from remote storage then it is definitely relevant to your problem

Patrick · Posted 04-26-2021 08:30 PM

If you need to free-up space on your disk, there are some really nice open source tools which help you understand what takes up space and what you could delete. I like WinDirStat a lot.

b0guna01 · Posted 04-30-2021 09:55 AM

we already cleaned, but we don't see a huge difference in terms of timing.

LinusH · Posted 05-02-2021 10:56 AM

So your problem now is that it takes to long to execute?

Primarly it's a matter of resources of your work station.

If you set

options fullstimer;

you will be given more information in the log of what resources is taking time.

It could be worth to check your computers Task Manager during execution to see CPU, memory, and I/O consumption from the OS view.

One thing that could affect performance is if you could use more of the RAM, so SAS doesn't have to swap as much. Compare your MEMSIZE and SORTSIZE settings with the available RAM you have when your computer is "resting".

Data never sleeps

mkeintz · Posted 04-26-2021 08:36 PM

You are doing a PROC SORT with a BY _ALL_. Given that you are also specifying NODUP, I don't think you care about data order as much as you merely want to remove duplicates.

Is your real goal just to eliminate duplicate records?

If so, there are ways (hash objects applied to MD5 or SHA256 encryption applied against a concatenation of all your variables) that can be used to eliminate duplicates without the burden of a sort.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

b0guna01 · Posted 04-30-2021 09:56 AM

I want to remove the duplicates before doing the data analysis. It has 124 variables.

mkeintz · Posted 04-30-2021 08:51 PM

OK, it's just de-duping. Then you can replicate the NODUP + BY _ALL_ results by using the MD5 function and a hash object:

proc sql noprint;
  /* Make a csv list of PUT(X,rb8) for all x's that are numeric variables */
  select cats('put(',name,',rb8.)') into :num_to_rb8 separated by ','
  from dictionary.columns 
  where libname='PROJECT' and memname='MEDICAID_V01_2010' and type='num';

  /* Get the total length of a single observation */
  select obslen into :concat_len
  from dictionary.tables
  where libname='PROJECT' and memname='MEDICAID_V01_2010';
quit;
%put &=num_to_rb8 ;
%put &=obslen ;

data want (drop=_:) ;
  set PROJECT.MEDICAID_V01_2010;

  /* Concatenate all the data into a single string ("message") named _CONCAT */
  length _concat $&concat_len ;
  _concat=cat(&num_to_rb8,of _character_);

  /* Make a "unique" signature for the message */
  length _md5 $16;
  _md5=md5(_concat);

  if _n_=1 then do;
    declare hash md5 (hashexp:10);
      md5.definekey('_md5');
      md5.definedata('_md5');
      md5.definedone();
  end;

  if md5.find()^=0 then do;
    output;  /*Output first obs for a given signature*/
    md5.add();
  end;
run;

I do not directly list numeric variables as arguments of the CAT (or CATX) functions, because different numeric values can generate matching _concat values, (in turn generating matching md5 values), destroying the whole point of de-duping here. Consider the two concatenations below, where X ^= Y, but cat(x,x)=cat(x,y):

data _null_;
  x=0.1234567890123456;
  y=0.1234567890123457;
  if x=y then put "X Equals Y" ;
  else put "X Does NOT Equal Y";

  cat_x_x = cat(x,x);
  cat_x_y = cat(x,y);
  if cat_x_x=cat_x_y then put "CAT(X,X) DOES Equal CAT(X,Y)";
  else put "CAT(X,X) does NOT = CAT(X,Y)";
run;

which generates the log

X Does NOT Equal Y
CAT(X,X) DOES Equal CAT(X,Y)

In other contexts (for instance when using the MD5 function), this is known as a "collision" - where distinct values of the original data generate equivalent results. That's because the CAT family of functions convert the numeric values into text prior to concatenation, which does not always represent the value to the needed precision. You can avoid that by keeping the original numeric "real binary" representation by using:

  cat_x_x = cat(put(x,rb8.),put(x,rb8.));
  cat_x_y = cat(put(x,rb8.),put(y,rb8.));

If you rerun the modified program the log will say:

X Does NOT Equal Y
CAT(X,X) does NOT = CAT(X,Y)

That's why you see my PROC SQL code generating the macrovar &num_to_rb8.

The usual concern about a collision risk is in using the MD5 function, but that risk is very very .... very low. It is intended to generate distinct values. Citing page 339 of Data Management Solutions Using SAS Hash Table Operations by Paul Dorfman and Don Henderson:

In the worst case scenario, the approximate number of items that need to be hashed to get a 50 percent chance of an MD5 collision is about 2**64≃2E+19. It means that to encounter just 1 collision, the MD5 function has to be executed against 100 quintillion distinct arguments the equal number of times, i.e., approximately 1 trillion times per second for 100 years. The probability of such an event is so infinitesimally negligible that one truly has an enormously greater chance of living through a baseball season where every single pitch is a strike and no batter ever gets on base. (Amusingly, some people who will confidently say that can never, ever happen may believe that an MD5 collision can happen.)

Now if the number of true duplicates in the original data set is low, one could identify the records having duplicate MD5 values, and then confirm they all arise from true duplicate observations. I'm not including such code here.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PeterClemmensen · Posted 05-01-2021 02:12 PM

@mkeintz, I would push the like button twice if I could.

The DATA to DATA Step Macro
Blog: SASnrd

ChrisNZ · Posted 05-12-2021 02:10 AM

Wonderful post as always @mkeintz .

A few notes:

1. The code fails if numeric and character variables are not present

2. Using function cat() for the string passed to md5() can fail to identify duplicates correctly if missing character values exist.

3. Using a long value for the data variable in the hash table is unneeded.

4. Using check() rather than find() to query the hash table is faster

Here is a quick macro that addresses these issues:

%macro dedupe(libin=, dsin=, libout=, dsout=);  %macro _; %mend _; 

%local string ;

%if %length(&libin )=0  %then %let libin =WORK;
%if %length(&libout)=0  %then %let libout=WORK;
%if %length(&dsout )=0  %then %let dsout =&dsin;
       
%if ^%sysfunc(exist(&libin..&dsin)) %then %do;
  %put ERROR: Data set &libin..&dsin not found.;
  %return;
%end;

%* Get list of variables;
proc contents data=&libin..&dsin noprint out=__CONTENTS__; run;

%* Generate complete value output for each variable; ;
proc sql noprint; 
  select ifc(TYPE=1, cats('"/"||put(' ,NAME,',rb8.)')
                   , cats('"/"||trim(',NAME,     ')') ) into :string separated by '||'
  from __CONTENTS__ ;
quit;
%put &=string ;


data  &libout..&dsout. (drop=__:) ;
  set &libin..&dsin.;

  %* Create a unique hash value for each observation ;
  length __MD5 $16 __DUMMY $1;
  __MD5=md5(&string.);

  %* Create a hash table ;
  if _N_=1 then do;
    declare hash MD5 (hashexp:10);
      MD5.definekey('__MD5');
      MD5.definedata('__DUMMY');
      MD5.definedone();
      __DUMMY=' ';
  end;

  %* If new record found, then save it, and update the hash table ;
  if MD5.check() then do;
    output;  
    MD5.add();
  end;

run;

%mend;
                         
%dedupe(libin=SASHELP, dsin=CLASS, libout=WORK, dsout=TEST);

High-Performance SAS Coding - Third Edition

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away