05-26-2015 03:58 AM
I can see the use in a non-distributed environment - easy to subset smaller part of a table.
But, for distributed environments, isn't the whole idea of that setup that you spread processing across the cluster, and not isolate it to a single node? Could you please elaborate a bit around this?
05-26-2015 04:46 AM
We usually partition distributed tables in LASR not because we want to limit processing to a single (ore some) node, but because we want to grantee, that "all of the rows that match a partition key are on a single machine".
For example if you do lot of by group processing on a specific variable (e.g. customerID) it makes sense to keep the data partitioned in the (distributed)memory. This will reduce the needed communication between nodes (no "shuffle" of data is needed).
If you additionally have a variable that you use to order observations within those groups (e.g. time), it also makes sense to "pre-order" data.
All this also makes possible traditional "data-step-like by group processing", but distributed! See an example: Using Partitioning and Scoring in:
And there might be one more important aspect: if some of the partitions are rarely used (because you apply a filter on the table while doing analysis), the unneeded memory blocks can be freed by the operating system.
05-26-2015 04:48 AM
Indexes are there to speed up I/O. Since in-memory processing does not do any I/O per design (after the data is loaded), this question is meaningless.