Hello everyone,
I am working with a giant dataset, including 1.5 billion observations and 45 variables. It contains time serious data, so one of the variables is date. Since It takes too much time to run a simple command, like proc means, I am trying to find a way to work with my dataset more efficiently. One way is to change my long dataset to a wider one. In that case I will have about 700 variables and 1 million onservations. In general, is it efficient to have more observations or more variables?
I also thought to improve the hardware. Can SSD be helpful? Is there any other way I can use to decrease the running time?
I appreciate any suggestion.
When asking about efficiency you might need to consider are you addressing execution time, memory/storage space, IO / network through put, time to code and/or effort to maintain.
One thing that often occurs with wide data is that later information makes the data set "wider" requiring adding additional variables in coding downstream. With appropriate group identification variables that usually isn't the case with "long" and adding a new value to an existing category will usually allow existing processing of BY groups or creating columns or rows to reports without rewriting existing code.
I would say this goes even further if the "wideness" [ if that's an acceptable term] is based on dates. It can be so much easier to select records based on something like: date ge '01JAN2014' then to deal with IF VarJan2014 > 0 and VarFeb2014>0 and <so forth>.
Plus your data can be grouped multiple ways by using different formats on variables that exist for each record without having to list a bunch of (possibly ever growing list) variables.
Also look up Normalization in databases for additional information on this topic.
I do not think there is any rule that says that one way is more efficient than other.
It depends on what you want to do. Sometimes, simple data step programming rather than Proc step might be helpful. Arrays, hash objects are waiting for you.
As long as the "mass" of data stays the same, you will experience the same I/O bottleneck. It would be more helpful if you can fraction your data and run your analysis on the subsets.
SSDs may help if your current disk assembly can not saturate the bandwidth of your I/O subsystem.
If you are truly I/O bound (watch for waitstates when your SAS session is running), have you tried to experiment with compression (compress=yes or compress=binary data set option)?
Yes, I tried compress.
Usually, wide data sets are not a good idea. But for some statistical processing like data mining cases, wide data sets coudl be required.
I suggest that look at other options first.
So what does your PROC MEANS/other programs look like?
Proc means is just an example. I mainly need to run regression.
You are going to end up wanting your data in long form for regression or time series analysis. Partitioning is almost certainly in your future, or subsampling and model averaging.
Steve Denham
When asking about efficiency you might need to consider are you addressing execution time, memory/storage space, IO / network through put, time to code and/or effort to maintain.
One thing that often occurs with wide data is that later information makes the data set "wider" requiring adding additional variables in coding downstream. With appropriate group identification variables that usually isn't the case with "long" and adding a new value to an existing category will usually allow existing processing of BY groups or creating columns or rows to reports without rewriting existing code.
I would say this goes even further if the "wideness" [ if that's an acceptable term] is based on dates. It can be so much easier to select records based on something like: date ge '01JAN2014' then to deal with IF VarJan2014 > 0 and VarFeb2014>0 and <so forth>.
Plus your data can be grouped multiple ways by using different formats on variables that exist for each record without having to list a bunch of (possibly ever growing list) variables.
Also look up Normalization in databases for additional information on this topic.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.