BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
majdi_ka
Obsidian | Level 7

@Rick_SAS,

 

1) 86 variables

2) 24 character variables

3) only one variable.

 

Yes, I totally understand this, and I think that's why it's kind of frustrating, Because if there was less variables, it would be very fast !

If there's a solution I think it can be more than a solution for this particular situation, but also probably a best practice when working with large data. 

 

MK

 

Rick_SAS
SAS Super FREQ

Editor's note: this is a popular topic so we're summarizing here - use PROC MEANS, asking for MAX and name the one variable in the VAR statement.  Even if the table is in a non-SAS database, the summary work should get pushed the database for fast processing.

 

Proc Means Data=MixTypeData Max;
  Var x1;
Run;

--- Details ---

So that everyone has the same data set to experiment on, here is a data set that has 10 million records, 86 variables, and 24 character variables of length 30  characters.  On my CPU, it takes 30 seconds to create the data and 4 SECONDS to compute the mean. 

 

@majdi_ka Please run this program.  If it is taking "more than 10 MINUTES" to run, then I favor @Kurt_Bremser's theory that it is hardware related.  I am not a hardware person, but others on the list might help you to investigate hardware/network issues like

  1. Is your drive is fragmented/congested/slow/etc.
  2. Is the data set on a mounted or networked drive? (Network access is slow.)
  3. Is the data stored in a non-native format so that SAS is reading the data by using CEDA? Also slow.

 

%let nCont = 62;             /* number of contin vars               */
%let nChar = 24;             /* number of character vars         */
%let N = 1e7;
data MixTypeData;   
drop i j;
array x{&nCont} x1-x&nCont;  
array c{&nChar} $30. c1-c&nChar;  
call streaminit(1);
do i = 1 to &N; 
   do j = 1 to &nCont;              
      x{j} = rand("Uniform");  
   end; 
   do j = 1 to &nChar;
      c{j} = "ABCDEFGHIJKLMNOP";
   end; 
   output;  
end;  
run;

Proc Means Data=MixTypeData Max;
  Var x1;
Run;

 

Kurt_Bremser
Super User

Rick, when you do that, you profit from the system's disk cache. If you do PROC MEANS immediately after creating the data set, most (if not all) of the file will still be in main memory.

 

I'm currently running a test on my (seriously old) server to get some values with different methods & settings.

 

(SAS 9.2, AIX 5.3 on p520, 2GB fibre channel SAN, Internal Ultra SCSI LVD disks)

 

Test1:

File created on SAN storage, program unchanged

File size ~12 GB

Creation: 4min 24 sec

Means: 1min 35sec

Test2:

compress option set, SAN storage

File size ~9.5 GB

 

Had to stop tests at this point, because some of my powerusers decided to run some real heavy-duty jobs and brought the server under severe load. WIll update once I have the server for myself.

majdi_ka
Obsidian | Level 7

I ran the program and It took 50 seconds to create the table and 6 seconds to calculate the max.

 

As I mentioned in a previous comment, I have 200 million lines and not 10 millions.

I tried on a subset of 50M and it took 2 minutes 30 seconds.

 

I then tried the compress option when creating a compressed version of the 50 M table. It took 1:20 which is a good improvement.

 

So the solution is mainly related with the size of the data. I store the SAS data from an oracle database. so every numeric data has a length of 8. 

 

I will think of taking less variables from the original table, and define a smaller length for numeric variables (3 for flags for example).

then I will add a compress=yes (or binary or char, I must reread some papers about this). and then sort and create necessary indexes.

 

 

I think this will save disk space and execution time.

 

Thank you for your help and effort! Really useful advices 🙂

 

MK

Ksharp
Super User

Could you run the following IML code on your computer . Hope your memory wouldn't blow up .

 

%let dsn=sashelp.class;
%let var=weight;

proc iml ;
use &dsn ;
read all var {&var};
close;
print (&var[<>]);
quit;
majdi_ka
Obsidian | Level 7

Sorry but it didn't work, I found that I don't have the licence for IML...(I've never worked with IML)

LinusH
Tourmaline | Level 20
Let us step back and think of why do we optimize. If this is not a very frequent query and/or it does not hinder you to the analysis you wish, it's "just" annoying and something you could live with.

On the other hand, if this is a frequent request, perhaps required by many users, I think that you need to step up the game.
-are you on the right platform, sizing may be required
-are you using the right data engine? SPDE let you read in parallel. If you wish to optimise aggregate functions investing in SPD Server could be an option, since it support query aggregation on indexes (hence - no table scan). Or an extrnal RDBMS (perhaps a columnar one base on the query of your post). Other options that comes in mind is in memory databases (LASR with Visual Analytics ).
Data never sleeps

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 21 replies
  • 24085 views
  • 10 likes
  • 6 in conversation