BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
m1986MM
Obsidian | Level 7

Hello everyone,

I am working with a giant dataset, including 1.5 billion observations and 45 variables. It contains time serious data, so one of the variables is date. Since It takes too much time to run a simple command, like proc means, I am trying to find a way to work with my dataset more efficiently. One way is to change my long dataset to a wider one. In that case I will have about 700 variables and 1 million onservations. In general, is it efficient to have more observations or more variables?

I also thought to improve the hardware. Can SSD be helpful? Is there any other way I can use to decrease the running time?

I appreciate any suggestion.

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

When asking about efficiency you might need to consider are you addressing execution time, memory/storage space, IO / network through put, time to code and/or effort to maintain.

One thing that often occurs with wide data is that later information makes the data set "wider" requiring adding additional variables in coding downstream. With appropriate group identification variables that usually isn't the case with "long" and adding a new value to an existing category will usually allow existing processing of BY groups or creating columns or rows to reports without rewriting existing code.

I would say this goes even further if the "wideness" [ if that's an acceptable term] is based on dates. It can be so much easier to select records based on something like: date ge '01JAN2014' then to deal with IF VarJan2014 > 0 and VarFeb2014>0 and <so forth>.

Plus your data can be grouped multiple ways by using different formats on variables that exist for each record without having to list a bunch of (possibly ever growing list) variables.

Also look up Normalization in databases for additional information on this topic.

View solution in original post

7 REPLIES 7
KachiM
Rhodochrosite | Level 12

I do not think there is any rule that says that one way is more efficient than other.

It depends on what you want to do. Sometimes, simple data step programming rather than Proc step might be helpful. Arrays, hash objects are waiting for you.

Kurt_Bremser
Super User

As long as the "mass" of data stays the same, you will experience the same I/O bottleneck. It would be more helpful if you can fraction your data and run your analysis on the subsets.

SSDs may help if your current disk assembly can not saturate the bandwidth of your I/O subsystem.

If you are truly I/O bound (watch for waitstates when your SAS session is running), have you tried to experiment with compression (compress=yes or compress=binary data set option)?

LinusH
Tourmaline | Level 20

Usually, wide data sets are not a good idea. But for some statistical processing like data mining cases, wide data sets coudl be required.

I suggest that look at other options first.

  • Subsets: with WHERE on indexed variables
  • Using CLASS in PROC MEANS
  • Adjust MEMSIZE
  • HW: SSD - yes, more RAM, more cores

So what does your PROC MEANS/other programs look like?

Data never sleeps
m1986MM
Obsidian | Level 7

Proc means is just an example. I mainly need to run regression.

SteveDenham
Jade | Level 19

You are going to end up wanting your data in long form for regression or time series analysis.  Partitioning is almost certainly in your future, or subsampling and model averaging.

Steve Denham

ballardw
Super User

When asking about efficiency you might need to consider are you addressing execution time, memory/storage space, IO / network through put, time to code and/or effort to maintain.

One thing that often occurs with wide data is that later information makes the data set "wider" requiring adding additional variables in coding downstream. With appropriate group identification variables that usually isn't the case with "long" and adding a new value to an existing category will usually allow existing processing of BY groups or creating columns or rows to reports without rewriting existing code.

I would say this goes even further if the "wideness" [ if that's an acceptable term] is based on dates. It can be so much easier to select records based on something like: date ge '01JAN2014' then to deal with IF VarJan2014 > 0 and VarFeb2014>0 and <so forth>.

Plus your data can be grouped multiple ways by using different formats on variables that exist for each record without having to list a bunch of (possibly ever growing list) variables.

Also look up Normalization in databases for additional information on this topic.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1434 views
  • 6 likes
  • 6 in conversation