Solved: Re: Is it more efficient to work with a longer dataset or a wider one?

m1986MM · Posted 03-30-2015 09:22 AM

Hello everyone,

I am working with a giant dataset, including 1.5 billion observations and 45 variables. It contains time serious data, so one of the variables is date. Since It takes too much time to run a simple command, like proc means, I am trying to find a way to work with my dataset more efficiently. One way is to change my long dataset to a wider one. In that case I will have about 700 variables and 1 million onservations. In general, is it efficient to have more observations or more variables?

I also thought to improve the hardware. Can SSD be helpful? Is there any other way I can use to decrease the running time?

I appreciate any suggestion.

ballardw · Posted 03-30-2015 03:02 PM

When asking about efficiency you might need to consider are you addressing execution time, memory/storage space, IO / network through put, time to code and/or effort to maintain.

One thing that often occurs with wide data is that later information makes the data set "wider" requiring adding additional variables in coding downstream. With appropriate group identification variables that usually isn't the case with "long" and adding a new value to an existing category will usually allow existing processing of BY groups or creating columns or rows to reports without rewriting existing code.

I would say this goes even further if the "wideness" [ if that's an acceptable term] is based on dates. It can be so much easier to select records based on something like: date ge '01JAN2014' then to deal with IF VarJan2014 > 0 and VarFeb2014>0 and <so forth>.

Plus your data can be grouped multiple ways by using different formats on variables that exist for each record without having to list a bunch of (possibly ever growing list) variables.

Also look up Normalization in databases for additional information on this topic.

View solution in original post

KachiM · Posted 03-30-2015 09:34 AM

I do not think there is any rule that says that one way is more efficient than other.

It depends on what you want to do. Sometimes, simple data step programming rather than Proc step might be helpful. Arrays, hash objects are waiting for you.

Kurt_Bremser · Posted 03-30-2015 09:35 AM

As long as the "mass" of data stays the same, you will experience the same I/O bottleneck. It would be more helpful if you can fraction your data and run your analysis on the subsets.

SSDs may help if your current disk assembly can not saturate the bandwidth of your I/O subsystem.

If you are truly I/O bound (watch for waitstates when your SAS session is running), have you tried to experiment with compression (compress=yes or compress=binary data set option)?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

m1986MM · Posted 03-30-2015 09:44 AM

Yes, I tried compress.

LinusH · Posted 03-30-2015 09:37 AM

Usually, wide data sets are not a good idea. But for some statistical processing like data mining cases, wide data sets coudl be required.

I suggest that look at other options first.

Subsets: with WHERE on indexed variables
Using CLASS in PROC MEANS
Adjust MEMSIZE
HW: SSD - yes, more RAM, more cores

So what does your PROC MEANS/other programs look like?

Data never sleeps

m1986MM · Posted 03-30-2015 11:11 AM

Proc means is just an example. I mainly need to run regression.

SteveDenham · Posted 03-30-2015 02:50 PM

You are going to end up wanting your data in long form for regression or time series analysis. Partitioning is almost certainly in your future, or subsampling and model averaging.

Steve Denham

ballardw · Posted 03-30-2015 03:02 PM

When asking about efficiency you might need to consider are you addressing execution time, memory/storage space, IO / network through put, time to code and/or effort to maintain.

One thing that often occurs with wide data is that later information makes the data set "wider" requiring adding additional variables in coding downstream. With appropriate group identification variables that usually isn't the case with "long" and adding a new value to an existing category will usually allow existing processing of BY groups or creating columns or rows to reports without rewriting existing code.

I would say this goes even further if the "wideness" [ if that's an acceptable term] is based on dates. It can be so much easier to select records based on something like: date ge '01JAN2014' then to deal with IF VarJan2014 > 0 and VarFeb2014>0 and <so forth>.

Plus your data can be grouped multiple ways by using different formats on variables that exist for each record without having to list a bunch of (possibly ever growing list) variables.

Also look up Normalization in databases for additional information on this topic.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away