BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Usagi
Fluorite | Level 6

File1 is about 3.5 GB with about 7.9 million records. File2 is about 1.1 GB with the same number of records. Variable ID is an unique identifier giving a 1:1 merge. Fileout (generated from the code below) has the same number of records as File1 and File2, but it is 14.6 GB. I checked with PROC CONTENTS that the corresponding variable Type, Len, Format and Informat are all the same in Fileout, so I am puzzled as to why the file size increased by about 3 times that of (File1 size) + (File2 size). How do I check why this is so?

 

DATA fileout;
    MERGE file1 file2;
    BY ID;
RUN;

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
7 REPLIES 7
ballardw
Super User

How many variables are involved? Variables not both sets are in the resultant set.

If your ID variable is duplicated in one set then you get a many-to-one merge resulting in the data from one set repeated for each duplicate of the ID value. Meaning that values in one set may well be repeated useing more storage space than you think.

 

Take a look at the results from this program, specifically the values of variable k.

data one;
   Do id= 1 to 3;
      do j= 1 to 5;
      output;
      end;
   end;
run;

data two;
   do id = 2 to 5;
      do k = 2 to 3;
      output;
      end;
   end;
run;

data merged;
   merge one two;
   by id;
run;

proc print data=merged;
   var id j k;
run;

 

 

Usagi
Fluorite | Level 6

No, there is no doubling up because ID is an unique identifier. Number of records or rows is the same in all 3 files. The number of variables in file1 is 184, in file2 is 66 and in fileout 249 (=184+66-1).

PGStats
Opal | Level 21

Was file1 or file2 compressed?

PG
Usagi
Fluorite | Level 6

That is my suspicion. Looking in more detail at the first table from PROC CONTENTS, Fileout had "NO" for "COMPRESSED", whereas File1 and File2 had "CHAR" for "COMPRESSED".

 

How do I compress a file? Would compression slow data processing for later on?

PGStats
Opal | Level 21

Yes it does take a bit more processing.

 

DATA fileout(COMPRESS=CHAR);
    MERGE file1 file2;
    BY ID;
RUN;

"Advantages of compressing a file include reduced storage requirements for the file and fewer I/O operations to read or write to the data during processing. However, more CPU resources are required to read a compressed file (because of the overhead of uncompressing each observation)."  - SAS doc.

PG
Reeza
Super User

Do any of your original data sets have compression set to yes?

Usagi
Fluorite | Level 6

Looking in more detail at the first table from PROC CONTENTS, Fileout had "NO" for "COMPRESSED", whereas File1 and File2 had "CHAR" for "COMPRESSED".

 

How do I compress a file? Would compression slow data processing for later on?

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 2901 views
  • 3 likes
  • 4 in conversation