Solved: Fastest way to calculate the max of a variable in a large dataset - Page 2

majdi_ka · Posted 02-03-2016 08:50 AM

1) 86 variables

2) 24 character variables

3) only one variable.

Yes, I totally understand this, and I think that's why it's kind of frustrating, Because if there was less variables, it would be very fast !

If there's a solution I think it can be more than a solution for this particular situation, but also probably a best practice when working with large data.

MK

Rick_SAS · Posted 02-03-2016 09:46 AM

Editor's note: this is a popular topic so we're summarizing here - use PROC MEANS, asking for MAX and name the one variable in the VAR statement. Even if the table is in a non-SAS database, the summary work should get pushed the database for fast processing.

Proc Means Data=MixTypeData Max;
  Var x1;
Run;

--- Details ---

So that everyone has the same data set to experiment on, here is a data set that has 10 million records, 86 variables, and 24 character variables of length 30 characters. On my CPU, it takes 30 seconds to create the data and 4 SECONDS to compute the mean.

@majdi_ka Please run this program. If it is taking "more than 10 MINUTES" to run, then I favor @Kurt_Bremser's theory that it is hardware related. I am not a hardware person, but others on the list might help you to investigate hardware/network issues like

Is your drive is fragmented/congested/slow/etc.
Is the data set on a mounted or networked drive? (Network access is slow.)
Is the data stored in a non-native format so that SAS is reading the data by using CEDA? Also slow.

%let nCont = 62;             /* number of contin vars               */
%let nChar = 24;             /* number of character vars         */
%let N = 1e7;
data MixTypeData;   
drop i j;
array x{&nCont} x1-x&nCont;  
array c{&nChar} $30. c1-c&nChar;  
call streaminit(1);
do i = 1 to &N; 
   do j = 1 to &nCont;              
      x{j} = rand("Uniform");  
   end; 
   do j = 1 to &nChar;
      c{j} = "ABCDEFGHIJKLMNOP";
   end; 
   output;  
end;  
run;

Proc Means Data=MixTypeData Max;
  Var x1;
Run;

Kurt_Bremser · Posted 02-03-2016 09:54 AM

Rick, when you do that, you profit from the system's disk cache. If you do PROC MEANS immediately after creating the data set, most (if not all) of the file will still be in main memory.

I'm currently running a test on my (seriously old) server to get some values with different methods & settings.

(SAS 9.2, AIX 5.3 on p520, 2GB fibre channel SAN, Internal Ultra SCSI LVD disks)

Test1:

File created on SAN storage, program unchanged

File size ~12 GB

Creation: 4min 24 sec

Means: 1min 35sec

Test2:

compress option set, SAN storage

File size ~9.5 GB

Had to stop tests at this point, because some of my powerusers decided to run some real heavy-duty jobs and brought the server under severe load. WIll update once I have the server for myself.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

majdi_ka · Posted 02-03-2016 10:41 AM

I ran the program and It took 50 seconds to create the table and 6 seconds to calculate the max.

As I mentioned in a previous comment, I have 200 million lines and not 10 millions.

I tried on a subset of 50M and it took 2 minutes 30 seconds.

I then tried the compress option when creating a compressed version of the 50 M table. It took 1:20 which is a good improvement.

So the solution is mainly related with the size of the data. I store the SAS data from an oracle database. so every numeric data has a length of 8.

I will think of taking less variables from the original table, and define a smaller length for numeric variables (3 for flags for example).

then I will add a compress=yes (or binary or char, I must reread some papers about this). and then sort and create necessary indexes.

I think this will save disk space and execution time.

Thank you for your help and effort! Really useful advices 🙂

MK

Ksharp · Posted 02-04-2016 02:40 AM

Could you run the following IML code on your computer . Hope your memory wouldn't blow up .

%let dsn=sashelp.class;
%let var=weight;

proc iml ;
use &dsn ;
read all var {&var};
close;
print (&var[<>]);
quit;

majdi_ka · Posted 02-04-2016 03:05 AM

Sorry but it didn't work, I found that I don't have the licence for IML...(I've never worked with IML)

LinusH · Posted 02-04-2016 02:00 AM

Let us step back and think of why do we optimize. If this is not a very frequent query and/or it does not hinder you to the analysis you wish, it's "just" annoying and something you could live with.

On the other hand, if this is a frequent request, perhaps required by many users, I think that you need to step up the game.
-are you on the right platform, sizing may be required
-are you using the right data engine? SPDE let you read in parallel. If you wish to optimise aggregate functions investing in SPD Server could be an option, since it support query aggregation on indexes (hence - no table scan). Or an extrnal RDBMS (perhaps a columnar one base on the query of your post). Other options that comes in mind is in memory databases (LASR with Visual Analytics ).

Data never sleeps

Re: Fastest way to calculate the max of a variable in a large dataset

Re: Fastest way to calculate the max of a variable in a large dataset

Re: Fastest way to calculate the max of a variable in a large dataset

Re: Fastest way to calculate the max of a variable in a large dataset

Re: Fastest way to calculate the max of a variable in a large dataset

Re: Fastest way to calculate the max of a variable in a large dataset

Re: Fastest way to calculate the max of a variable in a large dataset

Re: Fastest way to calculate the max of a variable in a large dataset

Re: Fastest way to calculate the max of a variable in a large dataset

Re: Fastest way to calculate the max of a variable in a large dataset

Re: Fastest way to calculate the max of a variable in a large dataset

Re: Fastest way to calculate the max of a variable in a large dataset

Re: Fastest way to calculate the max of a variable in a large dataset

Re: Fastest way to calculate the max of a variable in a large dataset

Registration is open

SAS Training: Just a Click Away