BookmarkSubscribeRSS Feed
hanfei28
Fluorite | Level 6

Hi All,

 

I used

 

data lib.A;

   set B C;

run;

 

to concatenate two data sets B and C. Data set B is 80.9GB, and data set C is 78.4GB. But I see resulting data set A is 378 GB. Could any one explain why data set A has much bigger size than sum of data sets B and C?

 

Thanks.

 

12 REPLIES 12
hanfei28
Fluorite | Level 6

Data sets B and C are not compressed.

Reeza
Super User

Do you get any messages about mismatching lengths or such in your log?

 

If so, that's likely why. You probably have a character field or multiple ones that are large in one of the data sets. You have mentioned the size but how how many records are in each data?

 

If you post a proc contents on each data set, the A,B and C, it should be relatively easy to see where the issue is. 


hanfei28
Fluorite | Level 6

I compared the proc contents output for the three sets. They have the same variables names and variable length. 

 

Data B has 198,840,456 rows 

Data C has 192,560,707 rows.

Data A has 391,401,163 rows.

 

r_behata
Barite | Level 11

Are Data A and Data B Created on the same Environment or were they copied from a different machine / Server ?

hanfei28
Fluorite | Level 6

Two contributing data sets, B and C are from the same system(Linux system). The SAS code is run on the same system and data set A is saved to the same directory. 

r_behata
Barite | Level 11

Check the default compression :

 

proc options option=compress;
run;

Apply Binary compression , if not the same and try .

 

options compress=binary;
hanfei28
Fluorite | Level 6

Here is the output: @r_behata 

 

rsubmit;


NOTE: Remote submit to POE commencing.
proc options option=compress;
run;

 

SAS (r) Proprietary Software Release 9.4 TS1M4

COMPRESS=NO Specifies the type of compression to use for observations in output SAS data
sets.

Reeza
Super User
You can't check just options unfortunately. You can have compression set on data sets, in general via proc options or on a library.
Reeza
Super User

@hanfei28 wrote:

I compared the proc contents output for the three sets. They have the same variables names and variable length. 

 

Data B has 198,840,456 rows 

Data C has 192,560,707 rows.

Data A has 391,401,163 rows.

 


If you believe their identical grab the attributes from sashelp.vcolumn and run proc compare to see the difference.

 

data a_details;
set sashelp.vcolumn;
where libname='WORK' and memname='A';
run;

data b_details;
set sashelp.vcolumn;
where libname='WORK' and memname='B';
run;

proc compare data=a_details compare=b_details;
run;
ballardw
Super User

If there are any variables not common to both sets you end up with missing values that will take up space. Each numeric would use 8 byes for each row of the data set not containing the variable.

 

So when you say you have

Data B has 198,840,456 rows 

Data C has 192,560,707 rows.

 

if you have one variable in B not in C then you potentially have 192560707*8 bytes of storage reserved. Multiply times number of variables in B not in C. If you have a variable in C not in B then 198840456*8 bytes potential disk use.

 

Long character values can make this potentially many more bytes.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 2519 views
  • 3 likes
  • 6 in conversation