Solved: Re: Data file size increased by about 3 times after merging

Usagi · Posted 02-08-2016 05:55 PM

File1 is about 3.5 GB with about 7.9 million records. File2 is about 1.1 GB with the same number of records. Variable ID is an unique identifier giving a 1:1 merge. Fileout (generated from the code below) has the same number of records as File1 and File2, but it is 14.6 GB. I checked with PROC CONTENTS that the corresponding variable Type, Len, Format and Informat are all the same in Fileout, so I am puzzled as to why the file size increased by about 3 times that of (File1 size) + (File2 size). How do I check why this is so?

DATA fileout;
MERGE file1 file2;
BY ID;
RUN;

PGStats · Posted 02-08-2016 06:20 PM

Was file1 or file2 compressed?

PG

View solution in original post

ballardw · Posted 02-08-2016 06:14 PM

How many variables are involved? Variables not both sets are in the resultant set.

If your ID variable is duplicated in one set then you get a many-to-one merge resulting in the data from one set repeated for each duplicate of the ID value. Meaning that values in one set may well be repeated useing more storage space than you think.

Take a look at the results from this program, specifically the values of variable k.

data one;
   Do id= 1 to 3;
      do j= 1 to 5;
      output;
      end;
   end;
run;

data two;
   do id = 2 to 5;
      do k = 2 to 3;
      output;
      end;
   end;
run;

data merged;
   merge one two;
   by id;
run;

proc print data=merged;
   var id j k;
run;

Usagi · Posted 02-08-2016 06:34 PM

No, there is no doubling up because ID is an unique identifier. Number of records or rows is the same in all 3 files. The number of variables in file1 is 184, in file2 is 66 and in fileout 249 (=184+66-1).

PGStats · Posted 02-08-2016 06:20 PM

Was file1 or file2 compressed?

PG

Usagi · Posted 02-08-2016 06:32 PM

That is my suspicion. Looking in more detail at the first table from PROC CONTENTS, Fileout had "NO" for "COMPRESSED", whereas File1 and File2 had "CHAR" for "COMPRESSED".

How do I compress a file? Would compression slow data processing for later on?

PGStats · Posted 02-08-2016 09:09 PM

Yes it does take a bit more processing.

DATA fileout(COMPRESS=CHAR);
    MERGE file1 file2;
    BY ID;
RUN;

"Advantages of compressing a file include reduced storage requirements for the file and fewer I/O operations to read or write to the data during processing. However, more CPU resources are required to read a compressed file (because of the overhead of uncompressing each observation)." - SAS doc.

PG

Reeza · Posted 02-08-2016 06:21 PM

Do any of your original data sets have compression set to yes?

Usagi · Posted 02-08-2016 06:33 PM

Looking in more detail at the first table from PROC CONTENTS, Fileout had "NO" for "COMPRESSED", whereas File1 and File2 had "CHAR" for "COMPRESSED".

How do I compress a file? Would compression slow data processing for later on?

Ready to join fellow brilliant minds for the SAS Hackathon?