Solved: Unexpected Size of Dataset

agoldma · Posted 05-26-2019 01:48 PM

I'm merging 2 datasets

count = 6,437,567 variables = 22 file size = 1.2 GB page size = 65,536 # pages = 18,566
count = 2,276,587 variables = 917 file size = 2.4 GB page size = 131,072 # pages = 19,726

The resulting dataset has:

count = 6,437,567 variables = 924 file size = 21.8 GB page size = 131,072 # pages = 170,379

I didn't expect to get 22 GB when merging 1GB with 2GB

This is a straight merge without any calculations (dropping/merging a few fields)

I get the same results with a data step merge as with SQL (left join)

None of the character fields are excessively long (the longest is 40 and only a few are that long)

All 3 datasets use CHAR compression; no indexes yet

Dataset option REUSE = YES reduced the output from 23 GB to 22 GB

I'm using SAS-EG 9.0401M5 in Linux

Does it really make sense for the output dataset to be 22 GB ?

FreelanceReinh · Posted 05-26-2019 06:37 PM

For each dataset we can roughly calculate the average amount of memory (bytes) occupied by each value:

(page size * # pages)/(# obs. * # variables). The results are about 8.6 for ds1, 1.2 for ds2 and 3.8 for ds3.

The fact that the average value in ds2 (probably including numeric values of 8 bytes and character values of varying lengths, possibly up to 40) occupies only 1.2 bytes suggests that the average observation of ds2 is highly compressed. It is quite possible, though, that the compression rate varies considerably between records: Observations with many missing character values will have much higher compression rates than observations with long non-missing values.

Observations of the latter type (in ds2) might have field1-field2 combinations whose BY groups in ds1 consist of several observations so that multiple copies of those long values are created. Thus, the distribution of compression rates can easily shift towards lower rates. (After all, it's plausible that you tend to match more non-missing values than missing values from ds2 to ds1.) But this means an increased average memory consumption in ds3 per value coming from ds2 (i.e. >1.2 bytes). The much larger space requirements per value from ds1 (8.6 bytes) increase the average in ds3 further -- why not to 3.8 bytes?

View solution in original post

PeterClemmensen · Posted 05-26-2019 03:24 PM

Can you post your code please?

The DATA to DATA Step Macro
Blog: SASnrd

agoldma · Posted 05-26-2019 03:44 PM

This is the data step version:

data project.ds3 (reuse=YES) ;

   merge project.ds1 (in = inX )
         project.ds2  
         ;
   by field1
      field2
      ;
   if inX ;   * optional ;
run;

Compression is done automatically based on the setup of my SAS-EG

PeterClemmensen · Posted 05-26-2019 04:27 PM

Are you sure that the data set is compressed? What happens if you add the Compress=Char Option like this?

data project.ds3 (reuse=YES compress=char) ;

   merge project.ds1 (in = inX )
         project.ds2  
         ;
   by field1
      field2
      ;
   if inX ;   * optional ;
run;

The DATA to DATA Step Macro
Blog: SASnrd

Kurt_Bremser · Posted 05-26-2019 05:17 PM

Which compression rate is reported in the log?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

agoldma · Posted 05-26-2019 05:56 PM

I know that the dataset ds3 is compressed because PROC CONTENTS says so and the log says that compression reduced the size by 54%

I still tried adding COMPRESS=CHAR (at the top) but it didn't change anything

I tried a few other things:

Data step copy -- produces the same size
PROC COPY -- produces the same size
PROC CPORT -- produces a smaller but still large transfer file of 14 GB
... PROC CIMPORT -- produces the same size (22 GB)
COMPRESS=BINARY -- reduced size to 10 GB (sill very large, coming from 1 + 2 GB)
PROC EXPORT to CSV = 9 GB (record length between 1045 and 19602)

Maybe 10 GB is the right answer (just hard to believe that 1 + 2 = 10)

Maybe this is why we made relational databases 50 years ago (and I should keep these 2 datasets separate)

Tom · Posted 05-26-2019 06:12 PM

10 sounds right. You are adding over 900 variables to your 6 million observations.

2.4*3 = 7.2 + 2.1 = 9.3

agoldma · Posted 05-26-2019 06:21 PM

When I imported that CSV file, I got 2 GB -- like I expected ... and it had bufsize=64k with CHAR compression

I tried making a dataset with bufsize=64k and BINARY compression... but it turned out to be 10GB

...even though BINARY compression reduced space by 79%

That CSV file must be missing something... or what it's missing is useless

FreelanceReinh · Posted 05-26-2019 06:37 PM

For each dataset we can roughly calculate the average amount of memory (bytes) occupied by each value:

(page size * # pages)/(# obs. * # variables). The results are about 8.6 for ds1, 1.2 for ds2 and 3.8 for ds3.

The fact that the average value in ds2 (probably including numeric values of 8 bytes and character values of varying lengths, possibly up to 40) occupies only 1.2 bytes suggests that the average observation of ds2 is highly compressed. It is quite possible, though, that the compression rate varies considerably between records: Observations with many missing character values will have much higher compression rates than observations with long non-missing values.

Observations of the latter type (in ds2) might have field1-field2 combinations whose BY groups in ds1 consist of several observations so that multiple copies of those long values are created. Thus, the distribution of compression rates can easily shift towards lower rates. (After all, it's plausible that you tend to match more non-missing values than missing values from ds2 to ds1.) But this means an increased average memory consumption in ds3 per value coming from ds2 (i.e. >1.2 bytes). The much larger space requirements per value from ds1 (8.6 bytes) increase the average in ds3 further -- why not to 3.8 bytes?

Reeza · Posted 05-26-2019 08:17 PM

You have variables overwriting each other as well. That doesn't concern you?

917+22 - 2 (linkage) = 937 but you only have 924.

agoldma · Posted 05-26-2019 08:45 PM

Hi Reeza,

Thank you for looking at this

I have a few variables overwriting eachother, but I'm not concerned about it because they have the same or nearly the same values.

When I imported the CSV file, I got 2 GB like I wanted... but... most of those 900 variables changed from numeric to character, so instead of 8 bytes they were using 1 byte... which explains the reduction from 10 GB to 2 GB.

I guess Tom and FreelanceReinhard are right -- that I shouldn't expect the same compression in the resulting dataset that I had in the source datasets.

Reeza · Posted 05-27-2019 12:19 AM

Did you import your data with PROC IMPORT? If it’s 1, I suspect you have a lot of missing, or truncated values.

agoldma · Posted 05-27-2019 09:06 AM

Yes, the import was bad or at least very different

One big difference was that most of the 900 variables changed from numeric to character of length 1

Length 1 is not sufficient for most of those variables

They have many missing values, so PROC IMPORT probably didn't work right

Expecting 2GB for the merged dataset is probably not realistic

Unexpected Size of Dataset

Re: Unexpected Size of Dataset

Re: Unexpected Size of Dataset

Re: Unexpected Size of Dataset

Re: Unexpected Size of Dataset

Re: Unexpected Size of Dataset

Re: Unexpected Size of Dataset

Re: Unexpected Size of Dataset

Re: Unexpected Size of Dataset

Re: Unexpected Size of Dataset

Re: Unexpected Size of Dataset

Re: Unexpected Size of Dataset

Re: Unexpected Size of Dataset

Re: Unexpected Size of Dataset

The 2025 SAS Hackathon has begun!