How to leverage the Hadoop Distributed File System using the SAS Scalable Performance Data Server

1 Like

In the latest installment in the SAS Data Management for Hadoop article series, I’ll explain how to leverage Hadoop using the SAS Scalable Performance Data (SPD) Server. The SPD Server is a data format that supports the creation of analytical base tables with hundreds of thousands of columns. These analytical base tables are used to support daily predictive analytical routines. Traditionally, Storage Area Network (SAN) storage has been (and continues to be) the primary storage platform for the SAS® Scalable Performance Data Server format. Due to cost constraints associated with SAN storage, companies have added their environments Hadoop to help minimize storage.

In the 5.2 release for the SAS® Scalable Performance Data Server, support for the Hadoop Distributed File System (HDFS) was added. Here are the supported Hadoop distributions, with or without Kerberos:

Cloudera CDH2

The SPD Server organizes data into a file format that has advantages for a distributed file system like HDFS. Advantages of the SPD Server file format include the following:

Data is separate from the metadata. The file format consists of separate files: one for data, one for metadata, and two for indexes. Each type of file has an identifying file extension. The extensions are .dpf for data, .mdf for metadata, and .hbx and .idx for indexes.

SPDE.Blog2.image1.png

SPD Server Cluster tables allow one or more identical structured SPD Server tables to be referenced as one logical table. The table(s) in the cluster table are called members. You can swap, add, or delete members without holding a lock on the cluster table. This provides flexibility in the maintenance of SPD Server cluster tables with the primary benefit being no downtime to end-users that need read access to the SPD Server cluster tables.
The SPD Server file format partitions the data by spreading it across multiple files based on a partition size. Each partition is stored as a separate physical file with the extension .dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file.

The default partition size is 128 megabytes. You can alter the default partition size by overwriting the MINPARTSIZE parameter of the spdserver.parm file.

Like SAS data sets, the SPD Server table supports analytical base tables containing hundreds of thousands of columns. These analytical base tables become source tables to predictive analytical routines.

SPDE.Blog2.image2.png

Follow the community for my next post where we explore how to create SPD Server tables on HDFS.

Here are links to other posts in the SAS Data Management for Hadoop series for reference:

Miressa · ‎12-30-2015

please help me to download SAS software

AnnaBrown · ‎01-04-2016

Hi Miressa,

Thanks for your comment and visiting the community! Can you provide more detail? What version of SAS do you have? Or are you referring to SAS free software? If the latter, check out the SAS Analytics U Community that's packed with info to get started with SAS: free software, how-to guides, and connections to experts.

Anna

How to leverage the Hadoop Distributed File System using the SAS Scalable Performance Data Server

Registration is open

SAS AI and Machine Learning Courses