- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello everyone,
I am working with a giant dataset, including 1.5 billion observations and 45 variables. It contains time serious data, so one of the variables is date. Since It takes too much time to run a simple command, like proc means, I am trying to find a way to work with my dataset more efficiently. One way is to change my long dataset to a wider one. In that case I will have about 700 variables and 1 million onservations. In general, is it efficient to have more observations or more variables?
I also thought to improve the hardware. Can SSD be helpful? Is there any other way I can use to decrease the running time?
I appreciate any suggestion.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
When asking about efficiency you might need to consider are you addressing execution time, memory/storage space, IO / network through put, time to code and/or effort to maintain.
One thing that often occurs with wide data is that later information makes the data set "wider" requiring adding additional variables in coding downstream. With appropriate group identification variables that usually isn't the case with "long" and adding a new value to an existing category will usually allow existing processing of BY groups or creating columns or rows to reports without rewriting existing code.
I would say this goes even further if the "wideness" [ if that's an acceptable term] is based on dates. It can be so much easier to select records based on something like: date ge '01JAN2014' then to deal with IF VarJan2014 > 0 and VarFeb2014>0 and <so forth>.
Plus your data can be grouped multiple ways by using different formats on variables that exist for each record without having to list a bunch of (possibly ever growing list) variables.
Also look up Normalization in databases for additional information on this topic.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I do not think there is any rule that says that one way is more efficient than other.
It depends on what you want to do. Sometimes, simple data step programming rather than Proc step might be helpful. Arrays, hash objects are waiting for you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
As long as the "mass" of data stays the same, you will experience the same I/O bottleneck. It would be more helpful if you can fraction your data and run your analysis on the subsets.
SSDs may help if your current disk assembly can not saturate the bandwidth of your I/O subsystem.
If you are truly I/O bound (watch for waitstates when your SAS session is running), have you tried to experiment with compression (compress=yes or compress=binary data set option)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I tried compress.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Usually, wide data sets are not a good idea. But for some statistical processing like data mining cases, wide data sets coudl be required.
I suggest that look at other options first.
- Subsets: with WHERE on indexed variables
- Using CLASS in PROC MEANS
- Adjust MEMSIZE
- HW: SSD - yes, more RAM, more cores
So what does your PROC MEANS/other programs look like?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Proc means is just an example. I mainly need to run regression.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You are going to end up wanting your data in long form for regression or time series analysis. Partitioning is almost certainly in your future, or subsampling and model averaging.
Steve Denham
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
When asking about efficiency you might need to consider are you addressing execution time, memory/storage space, IO / network through put, time to code and/or effort to maintain.
One thing that often occurs with wide data is that later information makes the data set "wider" requiring adding additional variables in coding downstream. With appropriate group identification variables that usually isn't the case with "long" and adding a new value to an existing category will usually allow existing processing of BY groups or creating columns or rows to reports without rewriting existing code.
I would say this goes even further if the "wideness" [ if that's an acceptable term] is based on dates. It can be so much easier to select records based on something like: date ge '01JAN2014' then to deal with IF VarJan2014 > 0 and VarFeb2014>0 and <so forth>.
Plus your data can be grouped multiple ways by using different formats on variables that exist for each record without having to list a bunch of (possibly ever growing list) variables.
Also look up Normalization in databases for additional information on this topic.