I have a very large dataset (70 gb) that has 6 columns and 200 million rows. I tested a subset of this data with about 3 mil rows. this takes about 1gb. I transposed this long dataset to a wide dataset that has about 5000 rows and 2000 columns (these columns are dates, except 5 first columns are other vars). This reduces the size significantly to 65mb.
So my questions are: Why did the size change so much when I transpose from long to wide? how does SAS store data? what memory/size does it assign to row vs column? Are there tricks when dealing with such large datasets?
Thanks
Which other variables (aside from the dates and the "by" variables) does your long dataset contain?
Is your original dataset uncompressed, while your transposed dataset is compressed?
Please see attached photo for a sample of my LONG dataset.
I use the following code to transpose the data.:
proc transpose data=test out=test2;
by fund_ticker secid name constituent_ticker detail_holding_type;
id date; var holding;
run;
To my knowledge, they are not compressed. How do I compress a SAS dataset? and does compress means making like a zip or rar file?
The simple COMPRESS=YES dataset or system option uses a run-length compression that reduces sequences of identical characters/strings.
You can see the state of a dataset in the output of PROC CONTENTS.
It sems that secid, name and probably constituent_ticker and detail_holding_type are redundant to fund_ticker. If that is the case, you can move them out to a lookup table and/or create fomats for them.
I see there are many repetitives of FUND_FICKER and of DETAIL_HOLDING_TYPE.
At least those two can be replaced by codes, having next format
proc format lib=library;
value fticker
1 = 'AADR'
.... add existing/possible values ...
;
value $iticker
'ADDR' = '1'
...
;
value htype
1 = 'EQUITY'
2 = 'CASH'
.... add existing/possible values ...
;
value $itype
'EQUITY' = '1'
'CASH' = '2'
...
;
run;
then replace those variables in your date, converting from char type to numeric:
data new;
set have(rename = (fund_ticker=v1 Detail_Holding_type=v2));
fund_ticker = input(put(v1,$iticker.),best3.);
Detail_Holding_type = input(put(v2,$itype.),best3.);
drop v1 v2;
run;
Beyond, if name is one to one as SecID you can drop NAME from your dataset
and display it by an appropriate format.
You compare two datasets:
1) LONG = 3*10^6 rows * (5 columns + 1 date);
each date holds 6 to 8 bytes depending on compression.
So the date part holds between 6*3*10^6 to 8*3*10^6 bytes
i.e. 18 to 24 MB only out of 1 GB. Most of size holds the first 5 variables.
2) WIDE = 5000 rows * (5 columns + 1995 dates);
The dates hold probably (compressed) 5000*1995*6 bytes =~ 57 MB only.
Thus the saved size is that part of 5 columns * (3*10^6 rows - 5000 rows).
Check the length of the 5 first columns. I suppose that some those 5 can be codded to a number and use format to display the full value. You may save a lot of space.
@somebody - If you want to work through a dataset size calculation yourself: https://v8doc.sas.com/sashtml/cms/zas-size.htm
Current SAS datasets use the V8 calculation.
A long (I find tall a better description) data set can be considered a large monolithic rectangle containing one or more categorical columns (identity'ish roles) having repeated values. A combination of columns having the same values in different rows form a group. The categorical columns model either the concept of a 'row' identifier, or a hierarchical organization suitable for BY group processing. The repeated values are what can make a tall table large. Suppose you had a categorical variable that is char(40) and an average group size of 100. That would be 4,000 characters per group. Pivoting (or transposing) the tall group into a wide row with 100 variables (presuming a model of one wide variable per tall row) there will be only ONE instance of the categorical value, reducing the repetition and resulting in nominally a 100:1 reduction in storage space required for the groups identity variables.
Roughly, the larger the storage needed for group/hierarchy identity the larger the reduction in space will be. However, switching data form from tall to wide may increase complexity and harden code in downstream process flows.
Time is money spent with no refunds. Disk space is cheap, coder time and maintenance is not.
Thank you, i thought so too. but being a phd student, disk space is also not cheap 😄
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.