Hi,
I am trying to understand how compress works with tables and views and what is the most optimized way of using it.
We have some tables that are extremely big. I can only see a view of that table and it takes usually about 15min to query from that view. I would like to find if there is a better way of using it to get results faster.
I am particularly interested in compress vs non compress and views vs table/datasets.
I know about other optimization methods like using DROP, KEEP, WHERE clause, defining variables to appropriate length, etc.
My understanding of SAS storage is that if variable is defined as 4000 characters and it only contains 4 character it will still use 4000 bytes on disk if it is not compressed or actual length of 4 bytes if it uses compress.
To find out I have done some test and I don't exactly understand why am I getting such results.
*Dataset 1, compressed;
data x_c(compress=yes);
length a $ 4000;
do i=1 to 1000000;
a='abcd';
output;
end;
run;
*Dataset 2, not compressed;
data x_c(compress=yes);
length a $ 4000;
do i=1 to 1000000;
a='abcd';
output;
end;
run;
I ran those queries and got the following timings:
Non compressed:
real time 5.68 seconds
user cpu time 1.08 seconds
system cpu time 2.93 seconds
Compressed:
real time 3.95 seconds
user cpu time 3.90 seconds
system cpu time 0.03 seconds
So in this case Non compressed took longer which makes sense.
Then I created views for both tables:
PROC SQL; CREATE VIEW tmpv AS SELECT * FROM x_noc;
PROC SQL; CREATE VIEW tmpvc AS SELECT * FROM x_c;
And then I queried few rows from each of datasets and views to see the speed of each of them.
PROC SQL; CREATE TABLE tmp1 AS SELECT * FROM x_noc where i > 999001;
PROC SQL; CREATE TABLE tmp2 AS SELECT * FROM x_c where i > 999001;
PROC SQL; CREATE TABLE tmp3 AS SELECT * FROM tmpv where i > 999001;
PROC SQL; CREATE TABLE tmp4 AS SELECT * FROM tmpvc where i > 999001;
And the timings were the following:
TMP1:
real time 3.01 seconds
user cpu time 0.20 seconds
system cpu time 1.70 seconds
TMP2:
real time 0.24 seconds
user cpu time 0.21 seconds
system cpu time 0.01 seconds
TMP3:
real time 1.68 seconds
user cpu time 0.17 seconds
system cpu time 1.37 seconds
TMP4:
real time 0.25 seconds
user cpu time 0.21 seconds
system cpu time 0.02 seconds
My question here is why is querying from view from non-compressed dataset faster than querying from non-compressed dataset itself?
Is there some sort of implicit compress in the view?
Is there a way I can improve speed for querying from my original view that takes 15mins? Initially I thought that querying from view that says data is not compressed (proc contents) is making things slow, but after doing my own test it doesn't look like it is any slower. What am I not understanding correctly here?
Note: I cannot create indexes as I do not have admin rights.
Using SAS EG 9.4 on windows but unix server.
Thank you in advance for your replies.
I ran your code on our AIX LPAR (just 2 POWER8 cores, so consider it "very small"):
data x_noc;
length a $ 4000;
do i=1 to 1000000;
a='abcd';
output;
end;
run;
data x_c(compress=yes);
length a $ 4000;
do i=1 to 1000000;
a='abcd';
output;
end;
run;
PROC SQL; CREATE VIEW tmpv AS SELECT * FROM x_noc; quit;
PROC SQL; CREATE VIEW tmpvc AS SELECT * FROM x_c; quit;
PROC SQL; CREATE TABLE tmp1 AS SELECT * FROM x_noc where i > 999001; quit;
PROC SQL; CREATE TABLE tmp2 AS SELECT * FROM x_c where i > 999001; quit;
PROC SQL; CREATE TABLE tmp3 AS SELECT * FROM tmpv where i > 999001; quit;
PROC SQL; CREATE TABLE tmp4 AS SELECT * FROM tmpvc where i > 999001; quit;
Note that I added quit statements, so that the NOTEs follow the code immediately (important for the last step, or otherwise the NOTE is triggered by the "magic line" that EG sends, and only appears after other code).
The log:
37 data x_noc; 38 length a $ 4000; 39 do i=1 to 1000000; 40 a='abcd'; 41 output; 42 end; 43 run; NOTE: The data set WORK.X_NOC has 1000000 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 40.54 seconds cpu time 5.06 seconds 44 45 data x_c(compress=yes); 46 length a $ 4000; 47 do i=1 to 1000000; 48 a='abcd'; 49 output; 50 end; 51 run; NOTE: The data set WORK.X_C has 1000000 observations and 2 variables. NOTE: Compressing data set WORK.X_C decreased size by 98.94 percent. Compressed is 330 pages; un-compressed would require 31250 pages. NOTE: DATA statement used (Total process time): real time 15.05 seconds cpu time 2.06 seconds 52 53 PROC SQL; 53 ! CREATE VIEW tmpv AS SELECT * FROM x_noc; NOTE: SQL view WORK.TMPV has been defined. 53 ! quit; NOTE: PROZEDUR SQL used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 54 PROC SQL; 54 ! CREATE VIEW tmpvc AS SELECT * FROM x_c; NOTE: SQL view WORK.TMPVC has been defined. 54 ! quit; NOTE: PROZEDUR SQL used (Total process time): real time 0.02 seconds cpu time 0.00 seconds 55 56 PROC SQL; 56 ! CREATE TABLE tmp1 AS SELECT * FROM x_noc where i > 999001; NOTE: Table WORK.TMP1 created, with 999 rows and 2 columns. 56 ! quit; NOTE: PROZEDUR SQL used (Total process time): real time 45.99 seconds cpu time 2.51 seconds 57 PROC SQL; 57 ! CREATE TABLE tmp2 AS SELECT * FROM x_c where i > 999001; NOTE: Table WORK.TMP2 created, with 999 rows and 2 columns. 57 ! quit; NOTE: PROZEDUR SQL used (Total process time): real time 1.52 seconds cpu time 0.17 seconds 58 PROC SQL; 58 ! CREATE TABLE tmp3 AS SELECT * FROM tmpv where i > 999001; NOTE: Table WORK.TMP3 created, with 999 rows and 2 columns. 58 ! quit; NOTE: PROZEDUR SQL used (Total process time): real time 53.58 seconds cpu time 2.27 seconds 59 PROC SQL; 59 ! CREATE TABLE tmp4 AS SELECT * FROM tmpvc where i > 999001; NOTE: Table WORK.TMP4 created, with 999 rows and 2 columns. 59 ! quit; NOTE: PROZEDUR SQL used (Total process time): real time 1.78 seconds cpu time 0.18 seconds
We had a moderate load on the server during the test, with a run queue of ~10 throughout.
As you can see, the difference between compressed and uncompressed is quite substantial, and stays that way when using the views.
You may have had external influences (other jobs). Either make sure you have no other process when running tests, or run the tests repeatedly to even that out.
I understand a SQL view as nothing else than encapsulated SQL code which gets executed in the moment I'm using the view.
For this reason you certainly can't have a compressed view as there is nothing to compress. It's always the underlying physical data.
As for the compression: That's on disk and it saves disk space and disk and eventually network i/o which is normally the slowest bit of your process. The variable gets still fully expanded/uncompressed when loaded into memory/PDV (and this expanding actually adds CPU overhead).
The only explanation I'm having why in your test using the view was that much quicker than using the physical table is that the data must still have been available in the disk cash and for this reason i/o was much quicker.
Probably use a copy of the physical table for the view in order to compare apples with apples.
And last but not least: Real times can also vary if there are other competing processes running on your machine.
I ran your code on our AIX LPAR (just 2 POWER8 cores, so consider it "very small"):
data x_noc;
length a $ 4000;
do i=1 to 1000000;
a='abcd';
output;
end;
run;
data x_c(compress=yes);
length a $ 4000;
do i=1 to 1000000;
a='abcd';
output;
end;
run;
PROC SQL; CREATE VIEW tmpv AS SELECT * FROM x_noc; quit;
PROC SQL; CREATE VIEW tmpvc AS SELECT * FROM x_c; quit;
PROC SQL; CREATE TABLE tmp1 AS SELECT * FROM x_noc where i > 999001; quit;
PROC SQL; CREATE TABLE tmp2 AS SELECT * FROM x_c where i > 999001; quit;
PROC SQL; CREATE TABLE tmp3 AS SELECT * FROM tmpv where i > 999001; quit;
PROC SQL; CREATE TABLE tmp4 AS SELECT * FROM tmpvc where i > 999001; quit;
Note that I added quit statements, so that the NOTEs follow the code immediately (important for the last step, or otherwise the NOTE is triggered by the "magic line" that EG sends, and only appears after other code).
The log:
37 data x_noc; 38 length a $ 4000; 39 do i=1 to 1000000; 40 a='abcd'; 41 output; 42 end; 43 run; NOTE: The data set WORK.X_NOC has 1000000 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 40.54 seconds cpu time 5.06 seconds 44 45 data x_c(compress=yes); 46 length a $ 4000; 47 do i=1 to 1000000; 48 a='abcd'; 49 output; 50 end; 51 run; NOTE: The data set WORK.X_C has 1000000 observations and 2 variables. NOTE: Compressing data set WORK.X_C decreased size by 98.94 percent. Compressed is 330 pages; un-compressed would require 31250 pages. NOTE: DATA statement used (Total process time): real time 15.05 seconds cpu time 2.06 seconds 52 53 PROC SQL; 53 ! CREATE VIEW tmpv AS SELECT * FROM x_noc; NOTE: SQL view WORK.TMPV has been defined. 53 ! quit; NOTE: PROZEDUR SQL used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 54 PROC SQL; 54 ! CREATE VIEW tmpvc AS SELECT * FROM x_c; NOTE: SQL view WORK.TMPVC has been defined. 54 ! quit; NOTE: PROZEDUR SQL used (Total process time): real time 0.02 seconds cpu time 0.00 seconds 55 56 PROC SQL; 56 ! CREATE TABLE tmp1 AS SELECT * FROM x_noc where i > 999001; NOTE: Table WORK.TMP1 created, with 999 rows and 2 columns. 56 ! quit; NOTE: PROZEDUR SQL used (Total process time): real time 45.99 seconds cpu time 2.51 seconds 57 PROC SQL; 57 ! CREATE TABLE tmp2 AS SELECT * FROM x_c where i > 999001; NOTE: Table WORK.TMP2 created, with 999 rows and 2 columns. 57 ! quit; NOTE: PROZEDUR SQL used (Total process time): real time 1.52 seconds cpu time 0.17 seconds 58 PROC SQL; 58 ! CREATE TABLE tmp3 AS SELECT * FROM tmpv where i > 999001; NOTE: Table WORK.TMP3 created, with 999 rows and 2 columns. 58 ! quit; NOTE: PROZEDUR SQL used (Total process time): real time 53.58 seconds cpu time 2.27 seconds 59 PROC SQL; 59 ! CREATE TABLE tmp4 AS SELECT * FROM tmpvc where i > 999001; NOTE: Table WORK.TMP4 created, with 999 rows and 2 columns. 59 ! quit; NOTE: PROZEDUR SQL used (Total process time): real time 1.78 seconds cpu time 0.18 seconds
We had a moderate load on the server during the test, with a run queue of ~10 throughout.
As you can see, the difference between compressed and uncompressed is quite substantial, and stays that way when using the views.
You may have had external influences (other jobs). Either make sure you have no other process when running tests, or run the tests repeatedly to even that out.
I have ran all the examples again and this time I got the same results as you. Querying from view with compressed underlying dataset gave me the same timing as querying directly from compressed dataset.
Not sure why I got different timings yesterday because I ran it multiple times and i always got the result same result as I mentioned in my first post.
Thank you.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.