Hi All,
I used
data lib.A;
set B C;
run;
to concatenate two data sets B and C. Data set B is 80.9GB, and data set C is 78.4GB. But I see resulting data set A is 378 GB. Could any one explain why data set A has much bigger size than sum of data sets B and C?
Thanks.
Are one of the two data sets compressed?
Data sets B and C are not compressed.
Do you get any messages about mismatching lengths or such in your log?
If so, that's likely why. You probably have a character field or multiple ones that are large in one of the data sets. You have mentioned the size but how how many records are in each data?
If you post a proc contents on each data set, the A,B and C, it should be relatively easy to see where the issue is.
I compared the proc contents output for the three sets. They have the same variables names and variable length.
Data B has 198,840,456 rows
Data C has 192,560,707 rows.
Data A has 391,401,163 rows.
Are Data A and Data B Created on the same Environment or were they copied from a different machine / Server ?
Two contributing data sets, B and C are from the same system(Linux system). The SAS code is run on the same system and data set A is saved to the same directory.
Check the default compression :
proc options option=compress; run;
Apply Binary compression , if not the same and try .
options compress=binary;
Here is the output: @r_behata
rsubmit;
NOTE: Remote submit to POE commencing.
proc options option=compress;
run;
SAS (r) Proprietary Software Release 9.4 TS1M4
COMPRESS=NO Specifies the type of compression to use for observations in output SAS data
sets.
@hanfei28 wrote:
I compared the proc contents output for the three sets. They have the same variables names and variable length.
Data B has 198,840,456 rows
Data C has 192,560,707 rows.
Data A has 391,401,163 rows.
If you believe their identical grab the attributes from sashelp.vcolumn and run proc compare to see the difference.
data a_details;
set sashelp.vcolumn;
where libname='WORK' and memname='A';
run;
data b_details;
set sashelp.vcolumn;
where libname='WORK' and memname='B';
run;
proc compare data=a_details compare=b_details;
run;
Please post the complete(!) output of proc contents for all three datasets.
If there are any variables not common to both sets you end up with missing values that will take up space. Each numeric would use 8 byes for each row of the data set not containing the variable.
So when you say you have
Data B has 198,840,456 rows
Data C has 192,560,707 rows.
if you have one variable in B not in C then you potentially have 192560707*8 bytes of storage reserved. Multiply times number of variables in B not in C. If you have a variable in C not in B then 198840456*8 bytes potential disk use.
Long character values can make this potentially many more bytes.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.