BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
plevcek
Fluorite | Level 6

Hi,

I am trying to understand how compress works with tables and views and what is the most optimized way of using it.

We have some tables that are extremely big. I can only see a view of that table and it takes usually about 15min to query from that view. I would like to find if there is a better way of using it to get results faster.

I am particularly interested in compress vs non compress and views vs table/datasets. 

I know about other optimization methods like using DROP, KEEP, WHERE clause, defining variables to appropriate length, etc.

 

My understanding of SAS storage is that if variable is defined as 4000 characters and it only contains 4 character it will still use 4000 bytes on disk if it is not compressed or actual length of 4 bytes if it uses compress.

To find out I have done some test and I don't exactly understand why am I getting such results.

 

 

*Dataset 1, compressed;
data x_c(compress=yes);
   length a $ 4000;
   do i=1 to 1000000;
      a='abcd';
      output;
   end;
run;

*Dataset 2, not compressed;
data x_c(compress=yes);
   length a $ 4000;
   do i=1 to 1000000;
      a='abcd';
      output;
   end;
run;

 

I ran those queries and got the following timings:

Non compressed:

real time 5.68 seconds
user cpu time 1.08 seconds
system cpu time 2.93 seconds

Compressed:

real time 3.95 seconds
user cpu time 3.90 seconds
system cpu time 0.03 seconds

 

So in this case Non compressed took longer which makes sense.

 

Then I created views for both tables:

 

PROC SQL; CREATE VIEW tmpv AS SELECT * FROM x_noc;
PROC SQL; CREATE VIEW tmpvc AS SELECT * FROM x_c;

And then I queried few rows from each of datasets and views to see the speed of each of them.

PROC SQL; CREATE TABLE tmp1 AS SELECT * FROM x_noc where i > 999001;
PROC SQL; CREATE TABLE tmp2 AS SELECT * FROM x_c where i > 999001;
PROC SQL; CREATE TABLE tmp3 AS SELECT * FROM tmpv where i > 999001;
PROC SQL; CREATE TABLE tmp4 AS SELECT * FROM tmpvc where i > 999001;

And the timings were the following:

TMP1: 

real time 3.01 seconds
user cpu time 0.20 seconds
system cpu time 1.70 seconds

 

TMP2:

real time 0.24 seconds
user cpu time 0.21 seconds
system cpu time 0.01 seconds

 

TMP3:

real time 1.68 seconds
user cpu time 0.17 seconds
system cpu time 1.37 seconds

 

TMP4:

real time 0.25 seconds
user cpu time 0.21 seconds
system cpu time 0.02 seconds

 

My question here is why is querying from view from non-compressed dataset faster than querying from non-compressed dataset itself?

Is there some sort of implicit compress in the view?

Is there a way I can improve speed for querying from my original view that takes 15mins? Initially I thought that querying from view that says data is not compressed (proc contents) is making things slow, but after doing my own test it doesn't look like it is any slower. What am I not understanding correctly here?

Note: I cannot create indexes as I do not have admin rights.

 

Using SAS EG 9.4 on windows but unix server.

 

Thank you in advance for your replies.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

I ran your code on our AIX LPAR (just 2 POWER8 cores, so consider it "very small"):

data x_noc;
   length a $ 4000;
   do i=1 to 1000000;
      a='abcd';
      output;
   end;
run;

data x_c(compress=yes);
   length a $ 4000;
   do i=1 to 1000000;
      a='abcd';
      output;
   end;
run;

PROC SQL; CREATE VIEW tmpv AS SELECT * FROM x_noc; quit;
PROC SQL; CREATE VIEW tmpvc AS SELECT * FROM x_c; quit;

PROC SQL; CREATE TABLE tmp1 AS SELECT * FROM x_noc where i > 999001; quit;
PROC SQL; CREATE TABLE tmp2 AS SELECT * FROM x_c where i > 999001; quit;
PROC SQL; CREATE TABLE tmp3 AS SELECT * FROM tmpv where i > 999001; quit;
PROC SQL; CREATE TABLE tmp4 AS SELECT * FROM tmpvc where i > 999001; quit;

Note that I added quit statements, so that the NOTEs follow the code immediately (important for the last step, or otherwise the NOTE is triggered by the "magic line" that EG sends, and only appears after other code).

The log:

37         data x_noc;
38            length a $ 4000;
39            do i=1 to 1000000;
40               a='abcd';
41               output;
42            end;
43         run;

NOTE: The data set WORK.X_NOC has 1000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           40.54 seconds
      cpu time            5.06 seconds
      

44         
45         data x_c(compress=yes);
46            length a $ 4000;
47            do i=1 to 1000000;
48               a='abcd';
49               output;
50            end;
51         run;

NOTE: The data set WORK.X_C has 1000000 observations and 2 variables.
NOTE: Compressing data set WORK.X_C decreased size by 98.94 percent. 
      Compressed is 330 pages; un-compressed would require 31250 pages.
NOTE: DATA statement used (Total process time):
      real time           15.05 seconds
      cpu time            2.06 seconds
      

52         
53         PROC SQL;
53       !           CREATE VIEW tmpv AS SELECT * FROM x_noc;
NOTE: SQL view WORK.TMPV has been defined.
53       !                                                    quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds
      

54         PROC SQL;
54       !           CREATE VIEW tmpvc AS SELECT * FROM x_c;
NOTE: SQL view WORK.TMPVC has been defined.
54       !                                                   quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds
      

55         
56         PROC SQL;
56       !           CREATE TABLE tmp1 AS SELECT * FROM x_noc where i > 999001;
NOTE: Table WORK.TMP1 created, with 999 rows and 2 columns.

56       !                                                                      quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           45.99 seconds
      cpu time            2.51 seconds
      

57         PROC SQL;
57       !           CREATE TABLE tmp2 AS SELECT * FROM x_c where i > 999001;
NOTE: Table WORK.TMP2 created, with 999 rows and 2 columns.

57       !                                                                    quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           1.52 seconds
      cpu time            0.17 seconds
      

58         PROC SQL;
58       !           CREATE TABLE tmp3 AS SELECT * FROM tmpv where i > 999001;
NOTE: Table WORK.TMP3 created, with 999 rows and 2 columns.

58       !                                                                     quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           53.58 seconds
      cpu time            2.27 seconds
      

59         PROC SQL;
59       !           CREATE TABLE tmp4 AS SELECT * FROM tmpvc where i > 999001;
NOTE: Table WORK.TMP4 created, with 999 rows and 2 columns.

59       !                                                                      quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           1.78 seconds
      cpu time            0.18 seconds
      

We had a moderate load on the server during the test, with a run queue of ~10 throughout.

As you can see, the difference between compressed and uncompressed is quite substantial, and stays that way when using the views.

You may have had external influences (other jobs). Either make sure you have no other process when running tests, or run the tests repeatedly to even that out.

View solution in original post

4 REPLIES 4
Patrick
Opal | Level 21

I understand a SQL view as nothing else than encapsulated SQL code which gets executed in the moment I'm using the view. 

For this reason you certainly can't have a compressed view as there is nothing to compress. It's always the underlying physical data.

 

As for the compression: That's on disk and it saves disk space and disk and eventually network i/o which is normally the slowest bit of your process. The variable gets still fully expanded/uncompressed when loaded into memory/PDV (and this expanding actually adds CPU overhead).

 

The only explanation I'm having why in your test using the view was that much quicker than using the physical table is that the data must still have been available in the disk cash and for this reason i/o was much quicker. 

Probably use a copy of the physical table for the view in order to compare apples with apples.

 

And last but not least: Real times can also vary if there are other competing processes running on your machine.

Astounding
PROC Star
For more reliable time estimates, add a zero:

1 to 10000000

Also note, compression is not an inherited feature. In all your tests, the output is uncompressed. That doesn't explain your results, but should be noted.
Kurt_Bremser
Super User

I ran your code on our AIX LPAR (just 2 POWER8 cores, so consider it "very small"):

data x_noc;
   length a $ 4000;
   do i=1 to 1000000;
      a='abcd';
      output;
   end;
run;

data x_c(compress=yes);
   length a $ 4000;
   do i=1 to 1000000;
      a='abcd';
      output;
   end;
run;

PROC SQL; CREATE VIEW tmpv AS SELECT * FROM x_noc; quit;
PROC SQL; CREATE VIEW tmpvc AS SELECT * FROM x_c; quit;

PROC SQL; CREATE TABLE tmp1 AS SELECT * FROM x_noc where i > 999001; quit;
PROC SQL; CREATE TABLE tmp2 AS SELECT * FROM x_c where i > 999001; quit;
PROC SQL; CREATE TABLE tmp3 AS SELECT * FROM tmpv where i > 999001; quit;
PROC SQL; CREATE TABLE tmp4 AS SELECT * FROM tmpvc where i > 999001; quit;

Note that I added quit statements, so that the NOTEs follow the code immediately (important for the last step, or otherwise the NOTE is triggered by the "magic line" that EG sends, and only appears after other code).

The log:

37         data x_noc;
38            length a $ 4000;
39            do i=1 to 1000000;
40               a='abcd';
41               output;
42            end;
43         run;

NOTE: The data set WORK.X_NOC has 1000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           40.54 seconds
      cpu time            5.06 seconds
      

44         
45         data x_c(compress=yes);
46            length a $ 4000;
47            do i=1 to 1000000;
48               a='abcd';
49               output;
50            end;
51         run;

NOTE: The data set WORK.X_C has 1000000 observations and 2 variables.
NOTE: Compressing data set WORK.X_C decreased size by 98.94 percent. 
      Compressed is 330 pages; un-compressed would require 31250 pages.
NOTE: DATA statement used (Total process time):
      real time           15.05 seconds
      cpu time            2.06 seconds
      

52         
53         PROC SQL;
53       !           CREATE VIEW tmpv AS SELECT * FROM x_noc;
NOTE: SQL view WORK.TMPV has been defined.
53       !                                                    quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds
      

54         PROC SQL;
54       !           CREATE VIEW tmpvc AS SELECT * FROM x_c;
NOTE: SQL view WORK.TMPVC has been defined.
54       !                                                   quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds
      

55         
56         PROC SQL;
56       !           CREATE TABLE tmp1 AS SELECT * FROM x_noc where i > 999001;
NOTE: Table WORK.TMP1 created, with 999 rows and 2 columns.

56       !                                                                      quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           45.99 seconds
      cpu time            2.51 seconds
      

57         PROC SQL;
57       !           CREATE TABLE tmp2 AS SELECT * FROM x_c where i > 999001;
NOTE: Table WORK.TMP2 created, with 999 rows and 2 columns.

57       !                                                                    quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           1.52 seconds
      cpu time            0.17 seconds
      

58         PROC SQL;
58       !           CREATE TABLE tmp3 AS SELECT * FROM tmpv where i > 999001;
NOTE: Table WORK.TMP3 created, with 999 rows and 2 columns.

58       !                                                                     quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           53.58 seconds
      cpu time            2.27 seconds
      

59         PROC SQL;
59       !           CREATE TABLE tmp4 AS SELECT * FROM tmpvc where i > 999001;
NOTE: Table WORK.TMP4 created, with 999 rows and 2 columns.

59       !                                                                      quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           1.78 seconds
      cpu time            0.18 seconds
      

We had a moderate load on the server during the test, with a run queue of ~10 throughout.

As you can see, the difference between compressed and uncompressed is quite substantial, and stays that way when using the views.

You may have had external influences (other jobs). Either make sure you have no other process when running tests, or run the tests repeatedly to even that out.

plevcek
Fluorite | Level 6

I have ran all the examples again and this time I got the same results as you. Querying from view with compressed underlying dataset gave me the same timing as querying directly from compressed dataset.

Not sure why I got different timings yesterday because I ran it multiple times and i always got the result same result as I mentioned in my first post.

 

Thank you.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1443 views
  • 1 like
  • 4 in conversation