topic Re: Size of long vs wide dataset in SAS Programming

Size of long vs wide dataset

somebody — Sat, 08 Aug 2020 05:45:19 GMT

I have a very large dataset (70 gb) that has 6 columns and 200 million rows. I tested a subset of this data with about 3 mil rows. this takes about 1gb. I transposed this long dataset to a wide dataset that has about 5000 rows and 2000 columns (these columns are dates, except 5 first columns are other vars). This reduces the size significantly to 65mb.

So my questions are: Why did the size change so much when I transpose from long to wide? how does SAS store data? what memory/size does it assign to row vs column? Are there tricks when dealing with such large datasets?

Thanks

Re: Size of long vs wide dataset

Kurt_Bremser — Sat, 08 Aug 2020 06:13:45 GMT

Which other variables (aside from the dates and the "by" variables) does your long dataset contain?

Is your original dataset uncompressed, while your transposed dataset is compressed?

Re: Size of long vs wide dataset

somebody — Sat, 08 Aug 2020 06:19:06 GMT

Please see attached photo for a sample of my LONG dataset.

I use the following code to transpose the data.:

proc transpose data=test out=test2;
	by fund_ticker secid name constituent_ticker detail_holding_type;
	id date; var holding;
run;

To my knowledge, they are not compressed. How do I compress a SAS dataset? and does compress means making like a zip or rar file?

Re: Size of long vs wide dataset

Shmuel — Sat, 08 Aug 2020 06:31:39 GMT

You compare two datasets:

1) LONG = 3*10^6 rows * (5 columns + 1 date);

each date holds 6 to 8 bytes depending on compression.

So the date part holds between 6*3*10^6 to 8*3*10^6 bytes

i.e. 18 to 24 MB only out of 1 GB. Most of size holds the first 5 variables.

2) WIDE = 5000 rows * (5 columns + 1995 dates);

The dates hold probably (compressed) 5000*1995*6 bytes =~ 57 MB only.

Thus the saved size is that part of 5 columns * (3*10^6 rows - 5000 rows).

Check the length of the 5 first columns. I suppose that some those 5 can be codded to a number and use format to display the full value. You may save a lot of space.

Re: Size of long vs wide dataset

Shmuel — Sat, 08 Aug 2020 06:29:57 GMT

for compression check proc option option=compress.
If need add option compress=yes.

Re: Size of long vs wide dataset

Kurt_Bremser — Sat, 08 Aug 2020 06:30:58 GMT

The simple COMPRESS=YES dataset or system option uses a run-length compression that reduces sequences of identical characters/strings.

You can see the state of a dataset in the output of PROC CONTENTS.

Re: Size of long vs wide dataset

SASKiwi — Sat, 08 Aug 2020 06:33:05 GMT

@somebody - If you want to work through a dataset size calculation yourself: https://v8doc.sas.com/sashtml/cms/zas-size.htm

Current SAS datasets use the V8 calculation.

Re: Size of long vs wide dataset

Kurt_Bremser — Sat, 08 Aug 2020 06:45:27 GMT

It sems that secid, name and probably constituent_ticker and detail_holding_type are redundant to fund_ticker. If that is the case, you can move them out to a lookup table and/or create fomats for them.

Re: Size of long vs wide dataset

Shmuel — Sat, 08 Aug 2020 07:27:11 GMT

I see there are many repetitives of FUND_FICKER and of DETAIL_HOLDING_TYPE.

At least those two can be replaced by codes, having next format

proc format lib=library;
  value  fticker
    1 = 'AADR'
   .... add existing/possible values ...
   ;
  value $iticker
   'ADDR' = '1'
   ...
   ;
  value htype
    1 = 'EQUITY'
    2 = 'CASH'
   .... add existing/possible values ...
   ; 
  value $itype
    'EQUITY' = '1'
    'CASH' = '2'
    ...
   ;	
run;

then replace those variables in your date, converting from char type to numeric:

data new;
 set have(rename = (fund_ticker=v1 Detail_Holding_type=v2));
     fund_ticker = input(put(v1,$iticker.),best3.);
	 Detail_Holding_type = input(put(v2,$itype.),best3.);
	 drop v1 v2;
run;

Beyond, if name is one to one as SecID you can drop NAME from your dataset

and display it by an appropriate format.

Re: Size of long vs wide dataset

RichardDeVen — Sat, 08 Aug 2020 13:42:12 GMT

A long (I find tall a better description) data set can be considered a large monolithic rectangle containing one or more categorical columns (identity'ish roles) having repeated values. A combination of columns having the same values in different rows form a group. The categorical columns model either the concept of a 'row' identifier, or a hierarchical organization suitable for BY group processing. The repeated values are what can make a tall table large. Suppose you had a categorical variable that is char(40) and an average group size of 100. That would be 4,000 characters per group. Pivoting (or transposing) the tall group into a wide row with 100 variables (presuming a model of one wide variable per tall row) there will be only ONE instance of the categorical value, reducing the repetition and resulting in nominally a 100:1 reduction in storage space required for the groups identity variables.

Roughly, the larger the storage needed for group/hierarchy identity the larger the reduction in space will be. However, switching data form from tall to wide may increase complexity and harden code in downstream process flows.

Time is money spent with no refunds. Disk space is cheap, coder time and maintenance is not.

Re: Size of long vs wide dataset

somebody — Sun, 09 Aug 2020 03:48:45 GMT

Thank you, i thought so too. but being a phd student, disk space is also not cheap 😄