Help using Base SAS procedures

Is it more efficient to work with a longer dataset or a wider one?

Accepted Solution Solved
Reply
Contributor
Posts: 65
Accepted Solution

Is it more efficient to work with a longer dataset or a wider one?

Hello everyone,

I am working with a giant dataset, including 1.5 billion observations and 45 variables. It contains time serious data, so one of the variables is date. Since It takes too much time to run a simple command, like proc means, I am trying to find a way to work with my dataset more efficiently. One way is to change my long dataset to a wider one. In that case I will have about 700 variables and 1 million onservations. In general, is it efficient to have more observations or more variables?

I also thought to improve the hardware. Can SSD be helpful? Is there any other way I can use to decrease the running time?

I appreciate any suggestion.


Accepted Solutions
Solution
‎03-30-2015 03:02 PM
Super User
Posts: 11,343

Re: Is it more efficient to work with a longer dataset or a wider one?

When asking about efficiency you might need to consider are you addressing execution time, memory/storage space, IO / network through put, time to code and/or effort to maintain.

One thing that often occurs with wide data is that later information makes the data set "wider" requiring adding additional variables in coding downstream. With appropriate group identification variables that usually isn't the case with "long" and adding a new value to an existing category will usually allow existing processing of BY groups or creating columns or rows to reports without rewriting existing code.

I would say this goes even further if the "wideness" [ if that's an acceptable term] is based on dates. It can be so much easier to select records based on something like: date ge '01JAN2014' then to deal with IF VarJan2014 > 0 and VarFeb2014>0 and <so forth>.

Plus your data can be grouped multiple ways by using different formats on variables that exist for each record without having to list a bunch of (possibly ever growing list) variables.

Also look up Normalization in databases for additional information on this topic.

View solution in original post


All Replies
Super Contributor
Posts: 298

Re: Is it more efficient to work with a longer dataset or a wider one?

I do not think there is any rule that says that one way is more efficient than other.

It depends on what you want to do. Sometimes, simple data step programming rather than Proc step might be helpful. Arrays, hash objects are waiting for you.

Super User
Posts: 7,854

Re: Is it more efficient to work with a longer dataset or a wider one?

As long as the "mass" of data stays the same, you will experience the same I/O bottleneck. It would be more helpful if you can fraction your data and run your analysis on the subsets.

SSDs may help if your current disk assembly can not saturate the bandwidth of your I/O subsystem.

If you are truly I/O bound (watch for waitstates when your SAS session is running), have you tried to experiment with compression (compress=yes or compress=binary data set option)?

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 65

Re: Is it more efficient to work with a longer dataset or a wider one?

Posted in reply to KurtBremser

Yes, I tried compress.

Super User
Posts: 5,437

Re: Is it more efficient to work with a longer dataset or a wider one?

Usually, wide data sets are not a good idea. But for some statistical processing like data mining cases, wide data sets coudl be required.

I suggest that look at other options first.

  • Subsets: with WHERE on indexed variables
  • Using CLASS in PROC MEANS
  • Adjust MEMSIZE
  • HW: SSD - yes, more RAM, more cores

So what does your PROC MEANS/other programs look like?

Data never sleeps
Contributor
Posts: 65

Re: Is it more efficient to work with a longer dataset or a wider one?

Proc means is just an example. I mainly need to run regression.

Respected Advisor
Posts: 2,655

Re: Is it more efficient to work with a longer dataset or a wider one?

You are going to end up wanting your data in long form for regression or time series analysis.  Partitioning is almost certainly in your future, or subsampling and model averaging.

Steve Denham

Solution
‎03-30-2015 03:02 PM
Super User
Posts: 11,343

Re: Is it more efficient to work with a longer dataset or a wider one?

When asking about efficiency you might need to consider are you addressing execution time, memory/storage space, IO / network through put, time to code and/or effort to maintain.

One thing that often occurs with wide data is that later information makes the data set "wider" requiring adding additional variables in coding downstream. With appropriate group identification variables that usually isn't the case with "long" and adding a new value to an existing category will usually allow existing processing of BY groups or creating columns or rows to reports without rewriting existing code.

I would say this goes even further if the "wideness" [ if that's an acceptable term] is based on dates. It can be so much easier to select records based on something like: date ge '01JAN2014' then to deal with IF VarJan2014 > 0 and VarFeb2014>0 and <so forth>.

Plus your data can be grouped multiple ways by using different formats on variables that exist for each record without having to list a bunch of (possibly ever growing list) variables.

Also look up Normalization in databases for additional information on this topic.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 271 views
  • 6 likes
  • 6 in conversation