SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Data file size increased by about 3 times after merging

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 6
Accepted Solution

Data file size increased by about 3 times after merging

[ Edited ]

File1 is about 3.5 GB with about 7.9 million records. File2 is about 1.1 GB with the same number of records. Variable ID is an unique identifier giving a 1:1 merge. Fileout (generated from the code below) has the same number of records as File1 and File2, but it is 14.6 GB. I checked with PROC CONTENTS that the corresponding variable Type, Len, Format and Informat are all the same in Fileout, so I am puzzled as to why the file size increased by about 3 times that of (File1 size) + (File2 size). How do I check why this is so?

 

DATA fileout;
    MERGE file1 file2;
    BY ID;
RUN;

 

 


Accepted Solutions
Solution
‎02-09-2016 01:27 AM
Respected Advisor
Posts: 4,919

Re: Data file size increased by about 3 times after merging

Was file1 or file2 compressed?

PG

View solution in original post


All Replies
Super User
Posts: 11,343

Re: Data file size increased by about 3 times after merging

How many variables are involved? Variables not both sets are in the resultant set.

If your ID variable is duplicated in one set then you get a many-to-one merge resulting in the data from one set repeated for each duplicate of the ID value. Meaning that values in one set may well be repeated useing more storage space than you think.

 

Take a look at the results from this program, specifically the values of variable k.

data one;
   Do id= 1 to 3;
      do j= 1 to 5;
      output;
      end;
   end;
run;

data two;
   do id = 2 to 5;
      do k = 2 to 3;
      output;
      end;
   end;
run;

data merged;
   merge one two;
   by id;
run;

proc print data=merged;
   var id j k;
run;

 

 

Occasional Contributor
Posts: 6

Re: Data file size increased by about 3 times after merging

[ Edited ]

No, there is no doubling up because ID is an unique identifier. Number of records or rows is the same in all 3 files. The number of variables in file1 is 184, in file2 is 66 and in fileout 249 (=184+66-1).

Solution
‎02-09-2016 01:27 AM
Respected Advisor
Posts: 4,919

Re: Data file size increased by about 3 times after merging

Was file1 or file2 compressed?

PG
Occasional Contributor
Posts: 6

Re: Data file size increased by about 3 times after merging

That is my suspicion. Looking in more detail at the first table from PROC CONTENTS, Fileout had "NO" for "COMPRESSED", whereas File1 and File2 had "CHAR" for "COMPRESSED".

 

How do I compress a file? Would compression slow data processing for later on?

Respected Advisor
Posts: 4,919

Re: Data file size increased by about 3 times after merging

[ Edited ]

Yes it does take a bit more processing.

 

DATA fileout(COMPRESS=CHAR);
    MERGE file1 file2;
    BY ID;
RUN;

"Advantages of compressing a file include reduced storage requirements for the file and fewer I/O operations to read or write to the data during processing. However, more CPU resources are required to read a compressed file (because of the overhead of uncompressing each observation)."  - SAS doc.

PG
Super User
Posts: 19,770

Re: Data file size increased by about 3 times after merging

Do any of your original data sets have compression set to yes?

Occasional Contributor
Posts: 6

Re: Data file size increased by about 3 times after merging

Looking in more detail at the first table from PROC CONTENTS, Fileout had "NO" for "COMPRESSED", whereas File1 and File2 had "CHAR" for "COMPRESSED".

 

How do I compress a file? Would compression slow data processing for later on?

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 491 views
  • 3 likes
  • 4 in conversation