- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
1) 86 variables
2) 24 character variables
3) only one variable.
Yes, I totally understand this, and I think that's why it's kind of frustrating, Because if there was less variables, it would be very fast !
If there's a solution I think it can be more than a solution for this particular situation, but also probably a best practice when working with large data.
MK
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Editor's note: this is a popular topic so we're summarizing here - use PROC MEANS, asking for MAX and name the one variable in the VAR statement. Even if the table is in a non-SAS database, the summary work should get pushed the database for fast processing.
Proc Means Data=MixTypeData Max;
Var x1;
Run;
--- Details ---
So that everyone has the same data set to experiment on, here is a data set that has 10 million records, 86 variables, and 24 character variables of length 30 characters. On my CPU, it takes 30 seconds to create the data and 4 SECONDS to compute the mean.
@majdi_ka Please run this program. If it is taking "more than 10 MINUTES" to run, then I favor @Kurt_Bremser's theory that it is hardware related. I am not a hardware person, but others on the list might help you to investigate hardware/network issues like
- Is your drive is fragmented/congested/slow/etc.
- Is the data set on a mounted or networked drive? (Network access is slow.)
- Is the data stored in a non-native format so that SAS is reading the data by using CEDA? Also slow.
%let nCont = 62; /* number of contin vars */
%let nChar = 24; /* number of character vars */
%let N = 1e7;
data MixTypeData;
drop i j;
array x{&nCont} x1-x&nCont;
array c{&nChar} $30. c1-c&nChar;
call streaminit(1);
do i = 1 to &N;
do j = 1 to &nCont;
x{j} = rand("Uniform");
end;
do j = 1 to &nChar;
c{j} = "ABCDEFGHIJKLMNOP";
end;
output;
end;
run;
Proc Means Data=MixTypeData Max;
Var x1;
Run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Rick, when you do that, you profit from the system's disk cache. If you do PROC MEANS immediately after creating the data set, most (if not all) of the file will still be in main memory.
I'm currently running a test on my (seriously old) server to get some values with different methods & settings.
(SAS 9.2, AIX 5.3 on p520, 2GB fibre channel SAN, Internal Ultra SCSI LVD disks)
Test1:
File created on SAN storage, program unchanged
File size ~12 GB
Creation: 4min 24 sec
Means: 1min 35sec
Test2:
compress option set, SAN storage
File size ~9.5 GB
Had to stop tests at this point, because some of my powerusers decided to run some real heavy-duty jobs and brought the server under severe load. WIll update once I have the server for myself.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I ran the program and It took 50 seconds to create the table and 6 seconds to calculate the max.
As I mentioned in a previous comment, I have 200 million lines and not 10 millions.
I tried on a subset of 50M and it took 2 minutes 30 seconds.
I then tried the compress option when creating a compressed version of the 50 M table. It took 1:20 which is a good improvement.
So the solution is mainly related with the size of the data. I store the SAS data from an oracle database. so every numeric data has a length of 8.
I will think of taking less variables from the original table, and define a smaller length for numeric variables (3 for flags for example).
then I will add a compress=yes (or binary or char, I must reread some papers about this). and then sort and create necessary indexes.
I think this will save disk space and execution time.
Thank you for your help and effort! Really useful advices 🙂
MK
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Could you run the following IML code on your computer . Hope your memory wouldn't blow up .
%let dsn=sashelp.class;
%let var=weight;
proc iml ;
use &dsn ;
read all var {&var};
close;
print (&var[<>]);
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sorry but it didn't work, I found that I don't have the licence for IML...(I've never worked with IML)
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
On the other hand, if this is a frequent request, perhaps required by many users, I think that you need to step up the game.
-are you on the right platform, sizing may be required
-are you using the right data engine? SPDE let you read in parallel. If you wish to optimise aggregate functions investing in SPD Server could be an option, since it support query aggregation on indexes (hence - no table scan). Or an extrnal RDBMS (perhaps a columnar one base on the query of your post). Other options that comes in mind is in memory databases (LASR with Visual Analytics ).
- « Previous
-
- 1
- 2
- Next »