BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
majdi_ka
Obsidian | Level 7

Hi,

 

I have a dataset with tens of millions of lines and I wand to calculate the max of a column.

calculating this max took more than 10 minutes with proc sql and proc means.

 

Is there a fastest way to do this (Arrays, hash objects, iml)

 

thanks in advance 

 

MK

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

Editor's note: this is a popular topic so we're summarizing here - use PROC MEANS, asking for MAX and name the one variable in the VAR statement.  Even if the table is in a non-SAS database, the summary work should get pushed the database for fast processing.

 

Proc Means Data=MixTypeData Max;
  Var x1;
Run;

--- Details ---

So that everyone has the same data set to experiment on, here is a data set that has 10 million records, 86 variables, and 24 character variables of length 30  characters.  On my CPU, it takes 30 seconds to create the data and 4 SECONDS to compute the mean. 

 

@majdi_ka Please run this program.  If it is taking "more than 10 MINUTES" to run, then I favor @Kurt_Bremser's theory that it is hardware related.  I am not a hardware person, but others on the list might help you to investigate hardware/network issues like

  1. Is your drive is fragmented/congested/slow/etc.
  2. Is the data set on a mounted or networked drive? (Network access is slow.)
  3. Is the data stored in a non-native format so that SAS is reading the data by using CEDA? Also slow.

 

%let nCont = 62;             /* number of contin vars               */
%let nChar = 24;             /* number of character vars         */
%let N = 1e7;
data MixTypeData;   
drop i j;
array x{&nCont} x1-x&nCont;  
array c{&nChar} $30. c1-c&nChar;  
call streaminit(1);
do i = 1 to &N; 
   do j = 1 to &nCont;              
      x{j} = rand("Uniform");  
   end; 
   do j = 1 to &nChar;
      c{j} = "ABCDEFGHIJKLMNOP";
   end; 
   output;  
end;  
run;

Proc Means Data=MixTypeData Max;
  Var x1;
Run;

 

View solution in original post

21 REPLIES 21
Kurt_Bremser
Super User

If you need to find different max values for groups, there are several methods that may differ in performance.

But if you just need the overall max of a column, all thats SAS does is one sequential scan through the data set, which is only limited by the I/O throughput of your computer.

No matter if you do

proc sql;
select max(columna) from have;
quit;

or

data want;
set have (end=done);
retain maxa 0;
maxa = max(columna,maxa);
if done then output;
run;

or max with proc means.

 

Just do a quick calculation of (physical size of the dataset) / (runtime in seconds).

 

If your dataset contains lots of mostly empty strings or other repeating patterns, using the dataset option compress=yes might save you lots of disk space and therefore lots of time spent doing I/O.

majdi_ka
Obsidian | Level 7

Hi @Kurt_Bremser,

 

Thanks for your answer.

 

I'am aware that the I/O is the main reason of the important execution time. That's why i'am looking for ways that load partially or totally data into memory.

I used this program but still have the same execution time (cpu time 1:38 // real time 11:42).

data want (keep=max);
     set have (keep=variable) end=eof;
     array v{1}  variable;
     max=v{1};
     do i=2 to dim(b);
          if max<max(v{i}) then max=max(v{i});
     end;
     if eof;
run;

For your suggestion of using compress option, I'm using the data step keep= option, which I think more efficient.

 

best regards,

 

MK

Kurt_Bremser
Super User

As long as you want to get the max of a column over the WHOLE dataset, there's no way around reading the WHOLE dataset. Period.

It's part of the way a SAS dataset is stored. You find all columns in all dataset pages, so thw whole shebang needs to be read.

The most important things are:

- reduce the physical storage size (if compress= has a noticable effect)

- make sure the dataset is only read once (no implicit sorting etc, happens often with proc sql)

 

If you find that you only need a subset of columns in most of your analyses, it makes sense to build a (horizontally) reduced dataset.

The same is true if you only need a special group of observations most of the time.

majdi_ka
Obsidian | Level 7

Hi @Kurt_Bremser,

 

Thanks for your answer.

 

I think you're right about calculating the max of a variable over the WHOLE dataset. And here the difference is in the number of columns and their size (many character variables with length >20 caracters for example).

 

I created a table with only the variable on which I want to calculate the max.

 

data variable_only;
set have (keep=variable);
run;

creating the table with the dataset option keep=variable took approximately the same time of the max calculation. Although I have a SSD disk, the operation was very slow (let's say ~5mb per second)

 

After this, I computed the max using sql :

proc sql;
    select max(variable)
    from variable_only;
quit;

and it took only few seconds.

 

So It's working with data containing all variables that makes it slow, even when working with keep= or equivalent...

Kurt_Bremser
Super User

majdi_ka wrote:

......

 

I created a table with only the variable on which I want to calculate the max.

 

data variable_only;
set have (keep=variable);
run;

creating the table with the dataset option keep=variable took approximately the same time of the max calculation. Although I have a SSD disk, the operation was very slow (let's say ~5mb per second)

 

After this, I computed the max using sql :

proc sql;
    select max(variable)
    from variable_only;
quit;

 

In the first step, you once again need to read the whole dataset. Writing the new dataset is a drop in the ocean compared to the cost of reading the big data. Although the keep= option looks as if you only read a small fraction of the data, the system still has to access the whole dataset.

 

In SAS, data is stored side-by-side in records, and the records consecutively, just as if you write it in a text file. RDBMS's will load and access data in a way that, when you access a single column, only that data is moved in memory; SAS can't do that with SAS tables.

 

Now, once you have the small dataset, you see how the max() operation improves by orders of magnitude.

 

Since you mentioned the presence of several character fileds with >20 characters, storing the big dataset with the compress= option is a must, IMO. Do that (if you haven't already), and then rerun all your operations to see how it changes performance.

majdi_ka
Obsidian | Level 7

This is very clear thanks.

 

In fact I didn't compress it yet.

I use this table to do joins. That's why I sorted it and created an index on it.

I don't know if the index still work if I compress it. In addition, I'm not sure if the dataset should be uncompressed each time I work on it, so it takes more time...

 

I will compress it and tell you the result !

 

Thanks 

 

MK

 

Kurt_Bremser
Super User

You need never uncompress a SAS dataset that was created with compress=yes. SAS will always decompress it on the fly. The CPU cycles needed for umcompress are always offset by the

Caveat: when a compressed dataset is sorted, the utility file is not compressed and will need extra space; this can be mitigated by using the tagsort option in proc sort.

 

Once you recreate a SAS dataset with compress=yes, you also need to recreate the index, as the physical location of observations within the dataset file has changed. IIRC, as I do not use index files (we rarely need only small subsets of files, we usually work with whole datasets or large parts thereof, where indexes do not improve performance, but rather make it worse).

Ksharp
Super User

Data step is good enough .

 

data _null_;
 set sashelp.class end=last;
 retain max;
 max=max(max,weight);
 if last then put max=;
run;
user24feb
Barite | Level 11

10 minutes computing time is somewhat strange.

I don't think hashing would be faster to get the max value than proc means or so.  Is there something special about your data set?

 

Options Fullstimer;

Data A;
  Do i=1 To 1e7; * 10 million rows;
    x=Rannor(1);
	Output;
  End;
Run;

Proc SQL;
  Select Max(x) From A;
Quit;

Proc Means Data=A Max;
  Var x;
Run;

Data _NULL_;
  Length x 8.;
  If _N_ eq 1 Then Do;
    Declare Hash H (Dataset:'A (Keep=x)',Hashexp:20);
	Declare HIter HI ('H');
	H.Definekey('x');
	H.Definedone();
    Call Missing (x);
  End;
  Retain Max;
  rc=HI.First();
  Do While (not rc);
    Max=IfN(x gt Max,x,Max);
    rc=HI.Next();
  End;
  If rc Then Put Max;
Run;
Kurt_Bremser
Super User

If he/she has "tens of millions" of lines:

ex. 50,000,000 lines of 500 bytes each will lead to ~25 GB of data, or a throughput of 2.5 GB per minute

2500/60 gives a sustained throughput of ~40 MB/sec, which is not that bad from a typical PC hard disk.

 

With SAS it makes sense to invest in fast storage, SSD's preferred nowadays.

 

majdi_ka
Obsidian | Level 7

hi @user24feb,

 

thanks for your program and your effort.

 

here the fastest way is proc sql. hash table is slower due to loading data into memory.

 

Doing the calculation with a dataset with only one variable is really fast. I think as I mentioned in an other comment, the problem appears when working with a large dataset with many variables (especially character variable).

 

MK

Rick_SAS
SAS Super FREQ

What version of SAS are you using? How much memory have you allocated to your SAS process? Please post the SAS log that results from running the following:

 

%put SYSVLONG = &SYSVLONG;
proc options option=memsize value; run;

 

majdi_ka
Obsidian | Level 7

Hi @Rick_SAS,

 

here is the output of the log (it's a french version):

 

 

SYSVLONG = 9.04.01M3P062415
26 proc options option=memsize value;
27 run;

SAS (r) Proprietary Software Version 9.4 TS1M3

Informations sur la valeur de l'option SAS MEMSIZE
Valeur :14014924800
Etendue : SAS Session
Comment la valeur d'option est définie : Config File
Nom du fichier de config. :
C:\Program Files\SASHome\SASFoundation\9.4\nls\fr\sasv9.cfg

Rick_SAS
SAS Super FREQ
OK, so you have the latest version of SAS (SAS 9.4m3) and 13GB of RAM. If you had a data set that has 10 million rows and ONE variable, computing the max would take a fraction of a second. As others have said, the issue is probably that you have a large number of variables and SAS has to read through the nonessential data to reach the numerical variables that it needs.

For our understanding, could you share:
1) How many total variables in the data set?
2) How many are character?
3) How many variables are you computing the max for?

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 21 replies
  • 23918 views
  • 10 likes
  • 6 in conversation