resulting data set size becomes much bigger after concatenating two da...

hanfei28 · Posted 11-26-2019 04:29 PM

Hi All,

I used

data lib.A;

set B C;

run;

to concatenate two data sets B and C. Data set B is 80.9GB, and data set C is 78.4GB. But I see resulting data set A is 378 GB. Could any one explain why data set A has much bigger size than sum of data sets B and C?

Thanks.

PeterClemmensen · Posted 11-26-2019 04:31 PM

Are one of the two data sets compressed?

The DATA to DATA Step Macro
Blog: SASnrd

hanfei28 · Posted 11-26-2019 04:33 PM

Data sets B and C are not compressed.

Reeza · Posted 11-26-2019 04:44 PM

Do you get any messages about mismatching lengths or such in your log?

If so, that's likely why. You probably have a character field or multiple ones that are large in one of the data sets. You have mentioned the size but how how many records are in each data?

If you post a proc contents on each data set, the A,B and C, it should be relatively easy to see where the issue is.

hanfei28 · Posted 11-26-2019 04:52 PM

I compared the proc contents output for the three sets. They have the same variables names and variable length.

Data B has 198,840,456 rows

Data C has 192,560,707 rows.

Data A has 391,401,163 rows.

r_behata · Posted 11-26-2019 04:55 PM

Are Data A and Data B Created on the same Environment or were they copied from a different machine / Server ?

hanfei28 · Posted 11-26-2019 04:58 PM

Two contributing data sets, B and C are from the same system(Linux system). The SAS code is run on the same system and data set A is saved to the same directory.

r_behata · Posted 11-26-2019 05:10 PM

Check the default compression :

proc options option=compress;
run;

Apply Binary compression , if not the same and try .

options compress=binary;

hanfei28 · Posted 11-26-2019 05:19 PM

Here is the output: @r_behata

rsubmit;

NOTE: Remote submit to POE commencing.
proc options option=compress;
run;

SAS (r) Proprietary Software Release 9.4 TS1M4

COMPRESS=NO Specifies the type of compression to use for observations in output SAS data
sets.

Reeza · Posted 11-26-2019 05:20 PM

You can't check just options unfortunately. You can have compression set on data sets, in general via proc options or on a library.

Reeza · Posted 11-26-2019 05:17 PM

@hanfei28 wrote:

I compared the proc contents output for the three sets. They have the same variables names and variable length.

Data B has 198,840,456 rows

Data C has 192,560,707 rows.

Data A has 391,401,163 rows.

If you believe their identical grab the attributes from sashelp.vcolumn and run proc compare to see the difference.

data a_details;
set sashelp.vcolumn;
where libname='WORK' and memname='A';
run;

data b_details;
set sashelp.vcolumn;
where libname='WORK' and memname='B';
run;

proc compare data=a_details compare=b_details;
run;

Kurt_Bremser · Posted 11-27-2019 04:18 AM

Please post the complete(!) output of proc contents for all three datasets.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 11-26-2019 05:20 PM

If there are any variables not common to both sets you end up with missing values that will take up space. Each numeric would use 8 byes for each row of the data set not containing the variable.

So when you say you have

Data B has 198,840,456 rows

Data C has 192,560,707 rows.

if you have one variable in B not in C then you potentially have 192560707*8 bytes of storage reserved. Multiply times number of variables in B not in C. If you have a variable in C not in B then 198840456*8 bytes potential disk use.

Long character values can make this potentially many more bytes.

resulting data set size becomes much bigger after concatenating two data sets

Re: resulting data set size becomes much bigger after concatenating two data sets

Re: resulting data set size becomes much bigger after concatenating two data sets

Re: resulting data set size becomes much bigger after concatenating two data sets

Re: resulting data set size becomes much bigger after concatenating two data sets

Re: resulting data set size becomes much bigger after concatenating two data sets

Re: resulting data set size becomes much bigger after concatenating two data sets

Re: resulting data set size becomes much bigger after concatenating two data sets

Re: resulting data set size becomes much bigger after concatenating two data sets

Re: resulting data set size becomes much bigger after concatenating two data sets

Re: resulting data set size becomes much bigger after concatenating two data sets

Re: resulting data set size becomes much bigger after concatenating two data sets

Re: resulting data set size becomes much bigger after concatenating two data sets

Registration is open