BookmarkSubscribeRSS Feed
ZRick
Obsidian | Level 7

To do the data manipulation (adding, sorting, spliting, searching) in memory for SAS, is hash table the only way?

11 REPLIES 11
Astounding
PROC Star

In a DATA step, all data manipulation takes place in memory.  What is it that you are trying to find out?

ZRick
Obsidian | Level 7

data tmp;

x=1;

run;

tmp is actual file written to a work directory, it is not the data manipulation in memory

LinusH
Tourmaline | Level 20

You might want to examine the concepts documentation carefully, and using the correct terminology.

In my world, no data manipulation (regarding of SW) takes place on disk alone - you need the CPU to do the manipulation on pieces of data that resides in the memory(aka RAM).

In your example, the data manipulation part is x=1;

This is followed by thet the result is saved in the file (or table) tmp - which is an I/O operation, not data manipulation.

So, like https://communities.sas.com/people/Astounding, I wonder what are you really looking for?

Using hash techniques, it lets you manipulate, lookup, sort data and keep the data explicitly in RAM before storing to disk.

Regular data step processing will mainly let you have the current row in memory (logically).

For lookups, you can use formats (which are loaded into memory during processing).

For sorting, aggregation and other statistical calculations, SQL and SAS PROCs will do a lot of the work with data loaded into the memory.

Data never sleeps
LinusH
Tourmaline | Level 20

Another option that I came to think of is the SASFILE statement, which will let explicitly read a table into memory, for use by multiple other steps during a SAS session.

Data never sleeps
ZRick
Obsidian | Level 7

I usually have data set with over 50000 securities doing 10000 simulations to valuate the price changes. Then I need to do a quick stat like mean,std, skewness, etc based on the generated data set.

Given the well over 500 million rows, apparently doing the calculation within the memory is the only option for fast processing.

Can you help think of anything using memory techniques with minimal I/O operations?

Doc_Duke
Rhodochrosite | Level 12

Jmp does everything in memory, though we don't have any machines with that much memory...

You might consider a hardware solution; getting a solid state drive.  It's not as fast as RAM, but it is a lot faster than a spindle disk.

BTW, you don't need 10000 simulations to get good estimates of those parameters; a few hundred will suffice.

Doc Muhlbaier

Duke

SASKiwi
PROC Star

One way to do in-memory DATA step processing is to load your input data into a temporary array (ARRAY X(2:2) _temporary_;) - for example an input table of 2 columns and 2 rows becomes a 2 by 2 temporary array. You can then process the array using DO statement loops.

Once processing is finished you then "unpack" the temporary array back into columns and rows to write it out.

I have used this technique for resource-intensive optimisation processes and it is pretty fast, but you are limited by memory and maximum array size.

There are some interesting developments in SAS 9.3 and use of PROC FCMP and more to come in 9.4 that may help in this area. Search SAS Support for more details.

ZRick
Obsidian | Level 7

10k simulation is what I was given, I have no way to change. hardware is also what is given to me.

I am trying to think from coding, programming sas perspective, some good ways to maximize the opportunity of using memory operation rather than frequent I/O operation

Reeza
Super User

This is one of those cases where I think not using a by or keeping the 500 million rows around is a great idea.

Calculating summarized statistics for each simulation resulting in a dataset of 10000 that you then need to analyze. This may require going back to basic definitions of mean, std, skewness.

SASKiwi
PROC Star

Its still not clear to me what you want to do and on what volume of data. Doing simulations on 50,000 rows of data in temporary arrays is feasible, but not 500 million. The array technique is good for resource-intensive processing of small amounts of data (eg simulations, modelling, optimisations), not for speeding up "normal" processing of very large datasets.

Here is an example of doing a mean on a temporary array: mean(of array_name(*)). Many SAS functions can be used this way.

ZRick
Obsidian | Level 7

the simulation result is 500 million rows, then it needs to be aggregated via slice and dice of the 500 million rows, which generates about 20 million rows of data, which is then outputed into a sas temp file. Then we use summary stats get min/std/max on that 14 millions of aggregated data.

Using sashelp.cars as example, assume it contains 500 million rows. then we aggregate it based all the combination of make, model, type, origin, drivetrain, enginesize, cylinders, horsepower, we could get 14 million rows of this aggregated data.

We want to do the stats of this aggregated data. But this 14 million rows of data is outputted into a sas temp file, which creates a huge bottleneck.

essentially, it is the derivation of a middle step, this middle step is a sas file. anyway to do entirely in memory?

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1231 views
  • 0 likes
  • 6 in conversation