BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
agoldma
Pyrite | Level 9

I'm merging 2 datasets

  1. count = 6,437,567 variables =  22   file size = 1.2 GB   page size =  65,536   # pages = 18,566
  2. count = 2,276,587 variables = 917  file size = 2.4 GB   page size = 131,072  # pages = 19,726

The resulting dataset has:

     count = 6,437,567   variables = 924  file size = 21.8 GB  page size = 131,072  # pages = 170,379

 

I didn't expect to get 22 GB when merging 1GB with 2GB

 

This is a straight merge without any calculations (dropping/merging a few fields)

I get the same results with a data step merge as with SQL (left join)

None of the character fields are excessively long (the longest is 40 and only a few are that long)

All 3 datasets use CHAR compression;  no indexes yet

Dataset option REUSE = YES reduced the output from 23 GB to 22 GB

I'm using SAS-EG 9.0401M5 in Linux

Does it really make sense for the output dataset to be 22 GB ?

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

For each dataset we can roughly calculate the average amount of memory (bytes) occupied by each value:

(page size * # pages)/(# obs. * # variables). The results are about 8.6 for ds1, 1.2 for ds2 and 3.8 for ds3.

 

The fact that the average value in ds2 (probably including numeric values of 8 bytes and character values of varying lengths, possibly up to 40) occupies only 1.2 bytes suggests that the average observation of ds2 is highly compressed. It is quite possible, though, that the compression rate varies considerably between records: Observations with many missing character values will have much higher compression rates than observations with long non-missing values.

 

Observations of the latter type (in ds2) might have field1-field2 combinations whose BY groups in ds1 consist of several observations so that multiple copies of those long values are created. Thus, the distribution of compression rates can easily shift towards lower rates. (After all, it's plausible that you tend to match more non-missing values than missing values from ds2 to ds1.) But this means an increased average memory consumption in ds3 per value coming from ds2 (i.e. >1.2 bytes). The much larger space requirements per value from ds1 (8.6 bytes) increase the average in ds3 further -- why not to 3.8 bytes?

View solution in original post

12 REPLIES 12
PeterClemmensen
Tourmaline | Level 20

Can you post your code please?

agoldma
Pyrite | Level 9

This is the data step version:

 

 

data project.ds3 (reuse=YES) ;

   merge project.ds1 (in = inX )
         project.ds2  
         ;
   by field1
      field2
      ;
   if inX ;   * optional ;
run;

 

Compression is done automatically based on the setup of my SAS-EG

PeterClemmensen
Tourmaline | Level 20

Are you sure that the data set is compressed? What happens if you add the Compress=Char Option like this?

 

data project.ds3 (reuse=YES compress=char) ;

   merge project.ds1 (in = inX )
         project.ds2  
         ;
   by field1
      field2
      ;
   if inX ;   * optional ;
run;
agoldma
Pyrite | Level 9

I know that the dataset ds3 is compressed because PROC CONTENTS says so and the log says that compression reduced the size by 54%

I still tried adding COMPRESS=CHAR (at the top) but it didn't change anything

I tried a few other things:

  1. Data step copy -- produces the same size
  2. PROC COPY -- produces the same size
  3. PROC CPORT -- produces a smaller but still large transfer file of 14 GB
     ... PROC CIMPORT  -- produces the same size (22 GB)
  4. COMPRESS=BINARY -- reduced size to 10 GB (sill very large, coming from 1 + 2 GB)
  5. PROC EXPORT to CSV = 9 GB (record length between 1045 and 19602)

Maybe 10 GB is the right answer (just hard to believe that 1 + 2 = 10)

Maybe this is why we made relational databases 50 years ago (and I should keep these 2 datasets separate)

Tom
Super User Tom
Super User

10 sounds right.  You are adding over 900 variables to your 6 million observations.

 

2.4*3 = 7.2 + 2.1 = 9.3 

 

agoldma
Pyrite | Level 9

When I imported that CSV file, I got 2 GB -- like I expected ... and it had bufsize=64k with CHAR compression

I tried making a dataset with bufsize=64k and BINARY compression... but it turned out to be 10GB

 ...even though BINARY compression reduced space by 79%

That CSV file must be missing something... or what it's missing is useless

FreelanceReinh
Jade | Level 19

For each dataset we can roughly calculate the average amount of memory (bytes) occupied by each value:

(page size * # pages)/(# obs. * # variables). The results are about 8.6 for ds1, 1.2 for ds2 and 3.8 for ds3.

 

The fact that the average value in ds2 (probably including numeric values of 8 bytes and character values of varying lengths, possibly up to 40) occupies only 1.2 bytes suggests that the average observation of ds2 is highly compressed. It is quite possible, though, that the compression rate varies considerably between records: Observations with many missing character values will have much higher compression rates than observations with long non-missing values.

 

Observations of the latter type (in ds2) might have field1-field2 combinations whose BY groups in ds1 consist of several observations so that multiple copies of those long values are created. Thus, the distribution of compression rates can easily shift towards lower rates. (After all, it's plausible that you tend to match more non-missing values than missing values from ds2 to ds1.) But this means an increased average memory consumption in ds3 per value coming from ds2 (i.e. >1.2 bytes). The much larger space requirements per value from ds1 (8.6 bytes) increase the average in ds3 further -- why not to 3.8 bytes?

Reeza
Super User
You have variables overwriting each other as well. That doesn't concern you?

917+22 - 2 (linkage) = 937 but you only have 924.
agoldma
Pyrite | Level 9

Hi Reeza,

Thank you for looking at this

I have a few variables overwriting eachother, but I'm not concerned about it because they have the same or nearly the same values.

When I imported the CSV file, I got 2 GB like I wanted... but... most of those 900 variables changed from numeric to character, so instead of 8 bytes they were using 1 byte... which explains the reduction from 10 GB to 2 GB.

I guess Tom and FreelanceReinhard are right -- that I shouldn't expect the same compression in the resulting dataset that I had in the source datasets.

Reeza
Super User
Did you import your data with PROC IMPORT? If it’s 1, I suspect you have a lot of missing, or truncated values.
agoldma
Pyrite | Level 9

Yes, the import was bad or at least very different

One big difference was that most of the 900 variables changed from numeric to character of length 1

Length 1 is not sufficient for most of those variables

They have many missing values, so PROC IMPORT probably didn't work right

Expecting 2GB for the merged dataset is probably not realistic

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 4314 views
  • 2 likes
  • 6 in conversation