Hello
Lets say I have a sas data set with 2 million rows and 200 columns.
Lets say that I consider perform transpose so there will be Long structure with 3 fields: Custt_ID,var_name, value
(so will have 400 million rows with 3 columns)
My question- Will the resulted data set have same size (in giga) as the orignal data set?
Note-
The data set include information of credit risk score model.
Each customer has geneeral information and also information of explenatory varaibles.
There are diffferent sets of explenatory varaibles depents on the model that the customer passed.
I want to create a panel that help users to analyse specific customer.
It is not rocket science to do some simple estimations. You could do it yourself...
/*
Let's assume we have 2 million rows and 201 columns, and
custt_ID is $ 32, and other are (for simplicity) all numeric.
So single observation is roughly 32+8*200
Everything times 2 millions:
*/
%let custID=32;
data _null_;
x = 2000000 * (&custid.+8*200) ;
GB = x / 1024**3;
put "1) " GB=;
run;
/*
After transposition, let's assume value stays numeric
but varname is $ 32 (to cover all possible variable names values)
every row changes into 200 rows, so with 2 millions:
*/
%let varname=32;
data _null_;
x = 2000000 * 200 * (&custid. + &varname. + 8) ;
GB = x / 1024**3;
put "2) " GB=;
run;
%let varname=8; /* if varname is shorter, say 8 */
data _null_;
x = 2000000 * 200 * (&custid. + &varname. + 8) ;
GB = x / 1024**3;
put "3) " GB=;
run;
LOG:
1) GB=3.0398368835
2) GB=26.822090149
3) GB=17.881393433
Bart
Lets say that I consider perform transpose so there will be Long structure with 3 fields: Custt_ID,var_name, value
I don't know the answer to your question about file size, but this arrangement of the data after the transpose will be a nightmare to use in further programming. Furthermore, when we talk about long data sets in SAS, what you describe isn't what we mean (although in non-SAS usage of the word "long" it certainly is long).
You might find some formulas that calculates this. But the easiest way is to take a subset and test yourself.
Things might affect result if you have compression on your data and indexing.
I'm usually pro long dataset over wide, but this is an extreme, key-value-pair model which is usually not useful for human usage.
What is the purpose of this re-modelling?
It is not rocket science to do some simple estimations. You could do it yourself...
/*
Let's assume we have 2 million rows and 201 columns, and
custt_ID is $ 32, and other are (for simplicity) all numeric.
So single observation is roughly 32+8*200
Everything times 2 millions:
*/
%let custID=32;
data _null_;
x = 2000000 * (&custid.+8*200) ;
GB = x / 1024**3;
put "1) " GB=;
run;
/*
After transposition, let's assume value stays numeric
but varname is $ 32 (to cover all possible variable names values)
every row changes into 200 rows, so with 2 millions:
*/
%let varname=32;
data _null_;
x = 2000000 * 200 * (&custid. + &varname. + 8) ;
GB = x / 1024**3;
put "2) " GB=;
run;
%let varname=8; /* if varname is shorter, say 8 */
data _null_;
x = 2000000 * 200 * (&custid. + &varname. + 8) ;
GB = x / 1024**3;
put "3) " GB=;
run;
LOG:
1) GB=3.0398368835
2) GB=26.822090149
3) GB=17.881393433
Bart
You may be interested in the compress option (link to SAS documentation page) if you are concerned about the size of the file. It would help claw back some of the wasted disk space reserved for blank cells that will be inevitably generated in the conversion if in an uncompressed format.
You can also estimate the size of your uncompressed table by taking the width of your table (set manually or calculated using SQL as shown in my reply to your question, and multiply by the sum of the size of the repeated variables you want to include (I would default to 8 bytes for each numeric variable and 200 for each character variable), then multiply by the number of distinct IDs.
COMPRESS option does save space, but it requires more CPU time.
Regarding setting the length to 3, if all the values of every one of the 200 variables wind up in the new dataset value column, you can't have length 3, you will most likely need length 8 unless ALL original variables fit in that length.
IMHO, estimating the new size of the dataset is a waste of time, as I said, the new data set is almost useless in that form.
Hi @Ronein
One technique I've been using to reduce my SAS data set size, is setting my Integer numeric variable to length of 3 instead of the default 8!
Where these integer values are typically used to store Boolean flags (0/1, True/False) for modeling purposes. But if you intent to use these integers for aggregations, then keeping them with the default length of 8, would preserve their precision.
Check this SAS on-line help SAS Help Center: Numeric Precision for more details.
Hope this helps
If possible, consider deleting observations with missing values.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.