I'm merging 2 datasets
The resulting dataset has:
count = 6,437,567 variables = 924 file size = 21.8 GB page size = 131,072 # pages = 170,379
I didn't expect to get 22 GB when merging 1GB with 2GB
This is a straight merge without any calculations (dropping/merging a few fields)
I get the same results with a data step merge as with SQL (left join)
None of the character fields are excessively long (the longest is 40 and only a few are that long)
All 3 datasets use CHAR compression; no indexes yet
Dataset option REUSE = YES reduced the output from 23 GB to 22 GB
I'm using SAS-EG 9.0401M5 in Linux
Does it really make sense for the output dataset to be 22 GB ?
For each dataset we can roughly calculate the average amount of memory (bytes) occupied by each value:
(page size * # pages)/(# obs. * # variables). The results are about 8.6 for ds1, 1.2 for ds2 and 3.8 for ds3.
The fact that the average value in ds2 (probably including numeric values of 8 bytes and character values of varying lengths, possibly up to 40) occupies only 1.2 bytes suggests that the average observation of ds2 is highly compressed. It is quite possible, though, that the compression rate varies considerably between records: Observations with many missing character values will have much higher compression rates than observations with long non-missing values.
Observations of the latter type (in ds2) might have field1-field2 combinations whose BY groups in ds1 consist of several observations so that multiple copies of those long values are created. Thus, the distribution of compression rates can easily shift towards lower rates. (After all, it's plausible that you tend to match more non-missing values than missing values from ds2 to ds1.) But this means an increased average memory consumption in ds3 per value coming from ds2 (i.e. >1.2 bytes). The much larger space requirements per value from ds1 (8.6 bytes) increase the average in ds3 further -- why not to 3.8 bytes?
Can you post your code please?
This is the data step version:
data project.ds3 (reuse=YES) ;
merge project.ds1 (in = inX )
project.ds2
;
by field1
field2
;
if inX ; * optional ;
run;
Compression is done automatically based on the setup of my SAS-EG
Are you sure that the data set is compressed? What happens if you add the Compress=Char Option like this?
data project.ds3 (reuse=YES compress=char) ;
merge project.ds1 (in = inX )
project.ds2
;
by field1
field2
;
if inX ; * optional ;
run;
Which compression rate is reported in the log?
I know that the dataset ds3 is compressed because PROC CONTENTS says so and the log says that compression reduced the size by 54%
I still tried adding COMPRESS=CHAR (at the top) but it didn't change anything
I tried a few other things:
Maybe 10 GB is the right answer (just hard to believe that 1 + 2 = 10)
Maybe this is why we made relational databases 50 years ago (and I should keep these 2 datasets separate)
10 sounds right. You are adding over 900 variables to your 6 million observations.
2.4*3 = 7.2 + 2.1 = 9.3
When I imported that CSV file, I got 2 GB -- like I expected ... and it had bufsize=64k with CHAR compression
I tried making a dataset with bufsize=64k and BINARY compression... but it turned out to be 10GB
...even though BINARY compression reduced space by 79%
That CSV file must be missing something... or what it's missing is useless
For each dataset we can roughly calculate the average amount of memory (bytes) occupied by each value:
(page size * # pages)/(# obs. * # variables). The results are about 8.6 for ds1, 1.2 for ds2 and 3.8 for ds3.
The fact that the average value in ds2 (probably including numeric values of 8 bytes and character values of varying lengths, possibly up to 40) occupies only 1.2 bytes suggests that the average observation of ds2 is highly compressed. It is quite possible, though, that the compression rate varies considerably between records: Observations with many missing character values will have much higher compression rates than observations with long non-missing values.
Observations of the latter type (in ds2) might have field1-field2 combinations whose BY groups in ds1 consist of several observations so that multiple copies of those long values are created. Thus, the distribution of compression rates can easily shift towards lower rates. (After all, it's plausible that you tend to match more non-missing values than missing values from ds2 to ds1.) But this means an increased average memory consumption in ds3 per value coming from ds2 (i.e. >1.2 bytes). The much larger space requirements per value from ds1 (8.6 bytes) increase the average in ds3 further -- why not to 3.8 bytes?
Hi Reeza,
Thank you for looking at this
I have a few variables overwriting eachother, but I'm not concerned about it because they have the same or nearly the same values.
When I imported the CSV file, I got 2 GB like I wanted... but... most of those 900 variables changed from numeric to character, so instead of 8 bytes they were using 1 byte... which explains the reduction from 10 GB to 2 GB.
I guess Tom and FreelanceReinhard are right -- that I shouldn't expect the same compression in the resulting dataset that I had in the source datasets.
Yes, the import was bad or at least very different
One big difference was that most of the 900 variables changed from numeric to character of length 1
Length 1 is not sufficient for most of those variables
They have many missing values, so PROC IMPORT probably didn't work right
Expecting 2GB for the merged dataset is probably not realistic
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.