BookmarkSubscribeRSS Feed
somebody
Lapis Lazuli | Level 10

I have a very large dataset (70 gb) that has 6 columns and 200 million rows. I tested a subset of this data with about 3 mil rows. this takes about 1gb. I transposed this long dataset to a wide dataset that has about 5000 rows and 2000 columns (these columns are dates, except 5 first columns are other vars). This reduces the size significantly to 65mb.

So my questions are: Why did the size change so much when I transpose from long to wide? how does SAS store data? what memory/size does it assign to row vs column? Are there tricks when dealing with such large datasets?

Thanks

 

10 REPLIES 10
Kurt_Bremser
Super User

Which other variables (aside from the dates and the "by" variables) does your long dataset contain?

Is your original dataset uncompressed, while your transposed dataset is compressed?

somebody
Lapis Lazuli | Level 10

Please see attached photo for a sample of my LONG dataset. 

somebody_0-1596867415027.png

I use the following code to transpose the data.:

proc transpose data=test out=test2;
	by fund_ticker secid name constituent_ticker detail_holding_type;
	id date; var holding;
run;

To my knowledge, they are not compressed. How do I compress a SAS dataset? and does compress means making like a zip or rar file?

 

Kurt_Bremser
Super User

The simple COMPRESS=YES dataset or system option uses a run-length compression that reduces sequences of identical characters/strings.

You can see the state of a dataset in the output of PROC CONTENTS.

Kurt_Bremser
Super User

It sems that secid, name and probably constituent_ticker and detail_holding_type are redundant to fund_ticker. If that is the case, you can move them out to a lookup table and/or create fomats for them.

Shmuel
Garnet | Level 18

I see there are many repetitives of FUND_FICKER and of DETAIL_HOLDING_TYPE.

At least those two can be replaced by codes, having next format 

proc format lib=library;
  value  fticker
    1 = 'AADR'
   .... add existing/possible values ...
   ;
  value $iticker
   'ADDR' = '1'
   ...
   ;
  value htype
    1 = 'EQUITY'
    2 = 'CASH'
   .... add existing/possible values ...
   ; 
  value $itype
    'EQUITY' = '1'
    'CASH' = '2'
    ...
   ;	
run;

then replace those variables in your date, converting from char type to numeric:

data new;
 set have(rename = (fund_ticker=v1 Detail_Holding_type=v2));
     fund_ticker = input(put(v1,$iticker.),best3.);
	 Detail_Holding_type = input(put(v2,$itype.),best3.);
	 drop v1 v2;
run;

Beyond, if name is one to one as SecID you can drop NAME from your dataset

and display it by an appropriate format.

 

 

 

 

Shmuel
Garnet | Level 18

You compare two datasets:

1) LONG = 3*10^6 rows *  (5 columns + 1 date);

    each date holds 6 to 8 bytes depending on compression.

   So the date part holds between 6*3*10^6 to 8*3*10^6 bytes 

   i.e. 18 to 24 MB only out of 1 GB. Most of size holds the first 5 variables.

 

2) WIDE = 5000 rows * (5 columns + 1995 dates);

     The dates hold probably (compressed)  5000*1995*6 bytes =~ 57 MB only.

     Thus the saved size is that part of 5 columns * (3*10^6 rows - 5000 rows).

 

Check the length of the 5 first columns. I suppose that some those 5 can be codded to a number and use format to display the full value. You may save a lot of space.

      

Shmuel
Garnet | Level 18
for compression check proc option option=compress.
If need add option compress=yes.
SASKiwi
PROC Star

@somebody  - If you want to work through a dataset size calculation yourself:  https://v8doc.sas.com/sashtml/cms/zas-size.htm

Current SAS datasets use the V8 calculation.

RichardDeVen
Barite | Level 11

A long (I find tall a better description) data set can be considered a large monolithic rectangle containing one or more categorical columns (identity'ish roles) having repeated values.  A combination of columns having the same values in different rows form a group.  The categorical columns model either the concept of a 'row' identifier, or a hierarchical organization suitable for BY group processing.  The repeated values are what can make a tall table large.  Suppose you had a categorical variable that is char(40) and an average group size of 100.  That would be 4,000 characters per group.  Pivoting (or transposing) the tall group into a wide row with 100 variables (presuming a model of one wide variable per tall row) there will be only ONE instance of the categorical value, reducing the repetition and resulting in nominally a 100:1 reduction in storage space required for the groups identity variables.

 

Roughly, the larger the storage needed for group/hierarchy identity the larger the reduction in space will be.  However, switching data form from tall to wide may increase complexity and harden code in downstream process flows.

 

Time is money spent with no refunds.  Disk space is cheap, coder time and maintenance is not.

 

 

somebody
Lapis Lazuli | Level 10

Thank you, i thought so too. but being a phd student, disk space is also not cheap 😄

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 2025 views
  • 1 like
  • 5 in conversation