02-26-2013 05:00 AM
You might want to examine the concepts documentation carefully, and using the correct terminology.
In my world, no data manipulation (regarding of SW) takes place on disk alone - you need the CPU to do the manipulation on pieces of data that resides in the memory(aka RAM).
In your example, the data manipulation part is x=1;
This is followed by thet the result is saved in the file (or table) tmp - which is an I/O operation, not data manipulation.
So, like https://communities.sas.com/people/Astounding, I wonder what are you really looking for?
Using hash techniques, it lets you manipulate, lookup, sort data and keep the data explicitly in RAM before storing to disk.
Regular data step processing will mainly let you have the current row in memory (logically).
For lookups, you can use formats (which are loaded into memory during processing).
For sorting, aggregation and other statistical calculations, SQL and SAS PROCs will do a lot of the work with data loaded into the memory.
02-26-2013 06:32 AM
Another option that I came to think of is the SASFILE statement, which will let explicitly read a table into memory, for use by multiple other steps during a SAS session.
02-26-2013 06:49 PM
I usually have data set with over 50000 securities doing 10000 simulations to valuate the price changes. Then I need to do a quick stat like mean,std, skewness, etc based on the generated data set.
Given the well over 500 million rows, apparently doing the calculation within the memory is the only option for fast processing.
Can you help think of anything using memory techniques with minimal I/O operations?
02-26-2013 08:03 PM
Jmp does everything in memory, though we don't have any machines with that much memory...
You might consider a hardware solution; getting a solid state drive. It's not as fast as RAM, but it is a lot faster than a spindle disk.
BTW, you don't need 10000 simulations to get good estimates of those parameters; a few hundred will suffice.
02-26-2013 02:20 PM
One way to do in-memory DATA step processing is to load your input data into a temporary array (ARRAY X(2:2) _temporary_ - for example an input table of 2 columns and 2 rows becomes a 2 by 2 temporary array. You can then process the array using DO statement loops.
Once processing is finished you then "unpack" the temporary array back into columns and rows to write it out.
I have used this technique for resource-intensive optimisation processes and it is pretty fast, but you are limited by memory and maximum array size.
There are some interesting developments in SAS 9.3 and use of PROC FCMP and more to come in 9.4 that may help in this area. Search SAS Support for more details.
02-27-2013 01:01 PM
10k simulation is what I was given, I have no way to change. hardware is also what is given to me.
I am trying to think from coding, programming sas perspective, some good ways to maximize the opportunity of using memory operation rather than frequent I/O operation
02-27-2013 01:35 PM
This is one of those cases where I think not using a by or keeping the 500 million rows around is a great idea.
Calculating summarized statistics for each simulation resulting in a dataset of 10000 that you then need to analyze. This may require going back to basic definitions of mean, std, skewness.
02-27-2013 02:58 PM
Its still not clear to me what you want to do and on what volume of data. Doing simulations on 50,000 rows of data in temporary arrays is feasible, but not 500 million. The array technique is good for resource-intensive processing of small amounts of data (eg simulations, modelling, optimisations), not for speeding up "normal" processing of very large datasets.
Here is an example of doing a mean on a temporary array: mean(of array_name(*)). Many SAS functions can be used this way.
02-28-2013 06:46 PM
the simulation result is 500 million rows, then it needs to be aggregated via slice and dice of the 500 million rows, which generates about 20 million rows of data, which is then outputed into a sas temp file. Then we use summary stats get min/std/max on that 14 millions of aggregated data.
Using sashelp.cars as example, assume it contains 500 million rows. then we aggregate it based all the combination of make, model, type, origin, drivetrain, enginesize, cylinders, horsepower, we could get 14 million rows of this aggregated data.
We want to do the stats of this aggregated data. But this 14 million rows of data is outputted into a sas temp file, which creates a huge bottleneck.
essentially, it is the derivation of a middle step, this middle step is a sas file. anyway to do entirely in memory?