DATA Step, Macro, Functions and more

data manipulation in memory, hash table is the only way?

Reply
Frequent Contributor
Posts: 133

data manipulation in memory, hash table is the only way?

To do the data manipulation (adding, sorting, spliting, searching) in memory for SAS, is hash table the only way?

Super User
Posts: 5,511

Re: data manipulation in memory, hash table is the only way?

In a DATA step, all data manipulation takes place in memory.  What is it that you are trying to find out?

Frequent Contributor
Posts: 133

Re: data manipulation in memory, hash table is the only way?

Posted in reply to Astounding

data tmp;

x=1;

run;

tmp is actual file written to a work directory, it is not the data manipulation in memory

Super User
Posts: 5,430

Re: data manipulation in memory, hash table is the only way?

You might want to examine the concepts documentation carefully, and using the correct terminology.

In my world, no data manipulation (regarding of SW) takes place on disk alone - you need the CPU to do the manipulation on pieces of data that resides in the memory(aka RAM).

In your example, the data manipulation part is x=1;

This is followed by thet the result is saved in the file (or table) tmp - which is an I/O operation, not data manipulation.

So, like https://communities.sas.com/people/Astounding, I wonder what are you really looking for?

Using hash techniques, it lets you manipulate, lookup, sort data and keep the data explicitly in RAM before storing to disk.

Regular data step processing will mainly let you have the current row in memory (logically).

For lookups, you can use formats (which are loaded into memory during processing).

For sorting, aggregation and other statistical calculations, SQL and SAS PROCs will do a lot of the work with data loaded into the memory.

Data never sleeps
Super User
Posts: 5,430

Re: data manipulation in memory, hash table is the only way?

Another option that I came to think of is the SASFILE statement, which will let explicitly read a table into memory, for use by multiple other steps during a SAS session.

Data never sleeps
Frequent Contributor
Posts: 133

Re: data manipulation in memory, hash table is the only way?

I usually have data set with over 50000 securities doing 10000 simulations to valuate the price changes. Then I need to do a quick stat like mean,std, skewness, etc based on the generated data set.

Given the well over 500 million rows, apparently doing the calculation within the memory is the only option for fast processing.

Can you help think of anything using memory techniques with minimal I/O operations?

Trusted Advisor
Posts: 2,116

Re: data manipulation in memory, hash table is the only way?

Jmp does everything in memory, though we don't have any machines with that much memory...

You might consider a hardware solution; getting a solid state drive.  It's not as fast as RAM, but it is a lot faster than a spindle disk.

BTW, you don't need 10000 simulations to get good estimates of those parameters; a few hundred will suffice.

Doc Muhlbaier

Duke

Super User
Posts: 3,255

Re: data manipulation in memory, hash table is the only way?

One way to do in-memory DATA step processing is to load your input data into a temporary array (ARRAY X(2:2) _temporary_Smiley Wink - for example an input table of 2 columns and 2 rows becomes a 2 by 2 temporary array. You can then process the array using DO statement loops.

Once processing is finished you then "unpack" the temporary array back into columns and rows to write it out.

I have used this technique for resource-intensive optimisation processes and it is pretty fast, but you are limited by memory and maximum array size.

There are some interesting developments in SAS 9.3 and use of PROC FCMP and more to come in 9.4 that may help in this area. Search SAS Support for more details.

Frequent Contributor
Posts: 133

Re: data manipulation in memory, hash table is the only way?

10k simulation is what I was given, I have no way to change. hardware is also what is given to me.

I am trying to think from coding, programming sas perspective, some good ways to maximize the opportunity of using memory operation rather than frequent I/O operation

Super User
Posts: 19,817

Re: data manipulation in memory, hash table is the only way?

This is one of those cases where I think not using a by or keeping the 500 million rows around is a great idea.

Calculating summarized statistics for each simulation resulting in a dataset of 10000 that you then need to analyze. This may require going back to basic definitions of mean, std, skewness.

Super User
Posts: 3,255

Re: data manipulation in memory, hash table is the only way?

Its still not clear to me what you want to do and on what volume of data. Doing simulations on 50,000 rows of data in temporary arrays is feasible, but not 500 million. The array technique is good for resource-intensive processing of small amounts of data (eg simulations, modelling, optimisations), not for speeding up "normal" processing of very large datasets.

Here is an example of doing a mean on a temporary array: mean(of array_name(*)). Many SAS functions can be used this way.

Frequent Contributor
Posts: 133

Re: data manipulation in memory, hash table is the only way?

the simulation result is 500 million rows, then it needs to be aggregated via slice and dice of the 500 million rows, which generates about 20 million rows of data, which is then outputed into a sas temp file. Then we use summary stats get min/std/max on that 14 millions of aggregated data.

Using sashelp.cars as example, assume it contains 500 million rows. then we aggregate it based all the combination of make, model, type, origin, drivetrain, enginesize, cylinders, horsepower, we could get 14 million rows of this aggregated data.

We want to do the stats of this aggregated data. But this 14 million rows of data is outputted into a sas temp file, which creates a huge bottleneck.

essentially, it is the derivation of a middle step, this middle step is a sas file. anyway to do entirely in memory?

Ask a Question
Discussion stats
  • 11 replies
  • 457 views
  • 0 likes
  • 6 in conversation