Fastest way to calculate the max of a variable in a large dataset

Reply
Contributor
Posts: 41

Fastest way to calculate the max of a variable in a large dataset

Hi,

 

I have a dataset with tens of millions of lines and I wand to calculate the max of a column.

calculating this max took more than 10 minutes with proc sql and proc means.

 

Is there a fastest way to do this (Arrays, hash objects, iml)

 

thanks in advance 

 

MK

Super User
Posts: 6,963

Re: Fastest way to calculate the max of a variable in a large dataset

If you need to find different max values for groups, there are several methods that may differ in performance.

But if you just need the overall max of a column, all thats SAS does is one sequential scan through the data set, which is only limited by the I/O throughput of your computer.

No matter if you do

proc sql;
select max(columna) from have;
quit;

or

data want;
set have (end=done);
retain maxa 0;
maxa = max(columna,maxa);
if done then output;
run;

or max with proc means.

 

Just do a quick calculation of (physical size of the dataset) / (runtime in seconds).

 

If your dataset contains lots of mostly empty strings or other repeating patterns, using the dataset option compress=yes might save you lots of disk space and therefore lots of time spent doing I/O.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 41

Re: Fastest way to calculate the max of a variable in a large dataset

Hi @KurtBremser,

 

Thanks for your answer.

 

I'am aware that the I/O is the main reason of the important execution time. That's why i'am looking for ways that load partially or totally data into memory.

I used this program but still have the same execution time (cpu time 1:38 // real time 11:42).

data want (keep=max);
     set have (keep=variable) end=eof;
     array v{1}  variable;
     max=v{1};
     do i=2 to dim(b);
          if max<max(v{i}) then max=max(v{i});
     end;
     if eof;
run;

For your suggestion of using compress option, I'm using the data step keep= option, which I think more efficient.

 

best regards,

 

MK

Super User
Posts: 6,963

Re: Fastest way to calculate the max of a variable in a large dataset

[ Edited ]

As long as you want to get the max of a column over the WHOLE dataset, there's no way around reading the WHOLE dataset. Period.

It's part of the way a SAS dataset is stored. You find all columns in all dataset pages, so thw whole shebang needs to be read.

The most important things are:

- reduce the physical storage size (if compress= has a noticable effect)

- make sure the dataset is only read once (no implicit sorting etc, happens often with proc sql)

 

If you find that you only need a subset of columns in most of your analyses, it makes sense to build a (horizontally) reduced dataset.

The same is true if you only need a special group of observations most of the time.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 41

Re: Fastest way to calculate the max of a variable in a large dataset

Hi @KurtBremser,

 

Thanks for your answer.

 

I think you're right about calculating the max of a variable over the WHOLE dataset. And here the difference is in the number of columns and their size (many character variables with length >20 caracters for example).

 

I created a table with only the variable on which I want to calculate the max.

 

data variable_only;
set have (keep=variable);
run;

creating the table with the dataset option keep=variable took approximately the same time of the max calculation. Although I have a SSD disk, the operation was very slow (let's say ~5mb per second)

 

After this, I computed the max using sql :

proc sql;
    select max(variable)
    from variable_only;
quit;

and it took only few seconds.

 

So It's working with data containing all variables that makes it slow, even when working with keep= or equivalent...

Super User
Posts: 6,963

Re: Fastest way to calculate the max of a variable in a large dataset


majdi_ka wrote:

......

 

I created a table with only the variable on which I want to calculate the max.

 

data variable_only;
set have (keep=variable);
run;

creating the table with the dataset option keep=variable took approximately the same time of the max calculation. Although I have a SSD disk, the operation was very slow (let's say ~5mb per second)

 

After this, I computed the max using sql :

proc sql;
    select max(variable)
    from variable_only;
quit;

 

In the first step, you once again need to read the whole dataset. Writing the new dataset is a drop in the ocean compared to the cost of reading the big data. Although the keep= option looks as if you only read a small fraction of the data, the system still has to access the whole dataset.

 

In SAS, data is stored side-by-side in records, and the records consecutively, just as if you write it in a text file. RDBMS's will load and access data in a way that, when you access a single column, only that data is moved in memory; SAS can't do that with SAS tables.

 

Now, once you have the small dataset, you see how the max() operation improves by orders of magnitude.

 

Since you mentioned the presence of several character fileds with >20 characters, storing the big dataset with the compress= option is a must, IMO. Do that (if you haven't already), and then rerun all your operations to see how it changes performance.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 41

Re: Fastest way to calculate the max of a variable in a large dataset

This is very clear thanks.

 

In fact I didn't compress it yet.

I use this table to do joins. That's why I sorted it and created an index on it.

I don't know if the index still work if I compress it. In addition, I'm not sure if the dataset should be uncompressed each time I work on it, so it takes more time...

 

I will compress it and tell you the result !

 

Thanks 

 

MK

 

Super User
Posts: 6,963

Re: Fastest way to calculate the max of a variable in a large dataset

You need never uncompress a SAS dataset that was created with compress=yes. SAS will always decompress it on the fly. The CPU cycles needed for umcompress are always offset by the

Caveat: when a compressed dataset is sorted, the utility file is not compressed and will need extra space; this can be mitigated by using the tagsort option in proc sort.

 

Once you recreate a SAS dataset with compress=yes, you also need to recreate the index, as the physical location of observations within the dataset file has changed. IIRC, as I do not use index files (we rarely need only small subsets of files, we usually work with whole datasets or large parts thereof, where indexes do not improve performance, but rather make it worse).

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 9,687

Re: Fastest way to calculate the max of a variable in a large dataset

Data step is good enough .

 

data _null_;
 set sashelp.class end=last;
 retain max;
 max=max(max,weight);
 if last then put max=;
run;
Super Contributor
Posts: 336

Re: Fastest way to calculate the max of a variable in a large dataset

10 minutes computing time is somewhat strange.

I don't think hashing would be faster to get the max value than proc means or so.  Is there something special about your data set?

 

Options Fullstimer;

Data A;
  Do i=1 To 1e7; * 10 million rows;
    x=Rannor(1);
	Output;
  End;
Run;

Proc SQL;
  Select Max(x) From A;
Quit;

Proc Means Data=A Max;
  Var x;
Run;

Data _NULL_;
  Length x 8.;
  If _N_ eq 1 Then Do;
    Declare Hash H (Dataset:'A (Keep=x)',Hashexp:20);
	Declare HIter HI ('H');
	H.Definekey('x');
	H.Definedone();
    Call Missing (x);
  End;
  Retain Max;
  rc=HI.First();
  Do While (not rc);
    Max=IfN(x gt Max,x,Max);
    rc=HI.Next();
  End;
  If rc Then Put Max;
Run;
Super User
Posts: 6,963

Re: Fastest way to calculate the max of a variable in a large dataset

If he/she has "tens of millions" of lines:

ex. 50,000,000 lines of 500 bytes each will lead to ~25 GB of data, or a throughput of 2.5 GB per minute

2500/60 gives a sustained throughput of ~40 MB/sec, which is not that bad from a typical PC hard disk.

 

With SAS it makes sense to invest in fast storage, SSD's preferred nowadays.

 

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 41

Re: Fastest way to calculate the max of a variable in a large dataset

hi @user24feb,

 

thanks for your program and your effort.

 

here the fastest way is proc sql. hash table is slower due to loading data into memory.

 

Doing the calculation with a dataset with only one variable is really fast. I think as I mentioned in an other comment, the problem appears when working with a large dataset with many variables (especially character variable).

 

MK

SAS Super FREQ
Posts: 3,483

Re: Fastest way to calculate the max of a variable in a large dataset

What version of SAS are you using? How much memory have you allocated to your SAS process? Please post the SAS log that results from running the following:

 

%put SYSVLONG = &SYSVLONG;
proc options option=memsize value; run;

 

Contributor
Posts: 41

Re: Fastest way to calculate the max of a variable in a large dataset

Hi @Rick_SAS,

 

here is the output of the log (it's a french version):

 

 

SYSVLONG = 9.04.01M3P062415
26 proc options option=memsize value;
27 run;

SAS (r) Proprietary Software Version 9.4 TS1M3

Informations sur la valeur de l'option SAS MEMSIZE
Valeur :14014924800
Etendue : SAS Session
Comment la valeur d'option est définie : Config File
Nom du fichier de config. :
C:\Program Files\SASHome\SASFoundation\9.4\nls\fr\sasv9.cfg

SAS Super FREQ
Posts: 3,483

Re: Fastest way to calculate the max of a variable in a large dataset

OK, so you have the latest version of SAS (SAS 9.4m3) and 13GB of RAM. If you had a data set that has 10 million rows and ONE variable, computing the max would take a fraction of a second. As others have said, the issue is probably that you have a large number of variables and SAS has to read through the nonessential data to reach the numerical variables that it needs.

For our understanding, could you share:
1) How many total variables in the data set?
2) How many are character?
3) How many variables are you computing the max for?
Ask a Question
Discussion stats
  • 21 replies
  • 807 views
  • 10 likes
  • 6 in conversation