Re: Size of long vs wide dataset

somebody · Posted 08-08-2020 01:45 AM

I have a very large dataset (70 gb) that has 6 columns and 200 million rows. I tested a subset of this data with about 3 mil rows. this takes about 1gb. I transposed this long dataset to a wide dataset that has about 5000 rows and 2000 columns (these columns are dates, except 5 first columns are other vars). This reduces the size significantly to 65mb.

So my questions are: Why did the size change so much when I transpose from long to wide? how does SAS store data? what memory/size does it assign to row vs column? Are there tricks when dealing with such large datasets?

Thanks

Kurt_Bremser · Posted 08-08-2020 02:13 AM

Which other variables (aside from the dates and the "by" variables) does your long dataset contain?

Is your original dataset uncompressed, while your transposed dataset is compressed?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

somebody · Posted 08-08-2020 02:19 AM

Please see attached photo for a sample of my LONG dataset.

I use the following code to transpose the data.:

proc transpose data=test out=test2;
	by fund_ticker secid name constituent_ticker detail_holding_type;
	id date; var holding;
run;

To my knowledge, they are not compressed. How do I compress a SAS dataset? and does compress means making like a zip or rar file?

Kurt_Bremser · Posted 08-08-2020 02:30 AM

The simple COMPRESS=YES dataset or system option uses a run-length compression that reduces sequences of identical characters/strings.

You can see the state of a dataset in the output of PROC CONTENTS.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 08-08-2020 02:45 AM

It sems that secid, name and probably constituent_ticker and detail_holding_type are redundant to fund_ticker. If that is the case, you can move them out to a lookup table and/or create fomats for them.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Shmuel · Posted 08-08-2020 03:27 AM

I see there are many repetitives of FUND_FICKER and of DETAIL_HOLDING_TYPE.

At least those two can be replaced by codes, having next format

proc format lib=library;
  value  fticker
    1 = 'AADR'
   .... add existing/possible values ...
   ;
  value $iticker
   'ADDR' = '1'
   ...
   ;
  value htype
    1 = 'EQUITY'
    2 = 'CASH'
   .... add existing/possible values ...
   ; 
  value $itype
    'EQUITY' = '1'
    'CASH' = '2'
    ...
   ;	
run;

then replace those variables in your date, converting from char type to numeric:

data new;
 set have(rename = (fund_ticker=v1 Detail_Holding_type=v2));
     fund_ticker = input(put(v1,$iticker.),best3.);
	 Detail_Holding_type = input(put(v2,$itype.),best3.);
	 drop v1 v2;
run;

Beyond, if name is one to one as SecID you can drop NAME from your dataset

and display it by an appropriate format.

Shmuel · Posted 08-08-2020 02:27 AM

You compare two datasets:

1) LONG = 3*10^6 rows * (5 columns + 1 date);

each date holds 6 to 8 bytes depending on compression.

So the date part holds between 6*3*10^6 to 8*3*10^6 bytes

i.e. 18 to 24 MB only out of 1 GB. Most of size holds the first 5 variables.

2) WIDE = 5000 rows * (5 columns + 1995 dates);

The dates hold probably (compressed) 5000*1995*6 bytes =~ 57 MB only.

Thus the saved size is that part of 5 columns * (3*10^6 rows - 5000 rows).

Check the length of the 5 first columns. I suppose that some those 5 can be codded to a number and use format to display the full value. You may save a lot of space.

Shmuel · Posted 08-08-2020 02:29 AM

for compression check proc option option=compress.
If need add option compress=yes.

SASKiwi · Posted 08-08-2020 02:33 AM

@somebody - If you want to work through a dataset size calculation yourself: https://v8doc.sas.com/sashtml/cms/zas-size.htm

Current SAS datasets use the V8 calculation.

RichardDeVen · Posted 08-08-2020 09:40 AM

A long (I find tall a better description) data set can be considered a large monolithic rectangle containing one or more categorical columns (identity'ish roles) having repeated values. A combination of columns having the same values in different rows form a group. The categorical columns model either the concept of a 'row' identifier, or a hierarchical organization suitable for BY group processing. The repeated values are what can make a tall table large. Suppose you had a categorical variable that is char(40) and an average group size of 100. That would be 4,000 characters per group. Pivoting (or transposing) the tall group into a wide row with 100 variables (presuming a model of one wide variable per tall row) there will be only ONE instance of the categorical value, reducing the repetition and resulting in nominally a 100:1 reduction in storage space required for the groups identity variables.

Roughly, the larger the storage needed for group/hierarchy identity the larger the reduction in space will be. However, switching data form from tall to wide may increase complexity and harden code in downstream process flows.

Time is money spent with no refunds. Disk space is cheap, coder time and maintenance is not.

somebody · Posted 08-08-2020 11:48 PM

Thank you, i thought so too. but being a phd student, disk space is also not cheap 😄

Classroom Training Available!