We’re smarter together. Learn from this collection of community knowledge and add your expertise.

How to leverage the Hadoop Distributed File System using the SAS Scalable Performance Data Server

by SAS Employee SteveSober on ‎11-20-2015 10:21 AM - edited on ‎01-19-2016 04:48 PM by Community Manager (1,127 Views)

In the latest installment in the SAS Data Management for Hadoop article series, I’ll explain how to leverage Hadoop using the SAS Scalable Performance Data (SPD) Server. The SPD Server is a data format that supports the creation of analytical base tables with hundreds of thousands of columns. These analytical base tables are used to support daily predictive analytical routines. Traditionally, Storage Area Network (SAN) storage has been (and continues to be) the primary storage platform for the SAS® Scalable Performance Data Server format. Due to cost constraints associated with SAN storage, companies have added their environments Hadoop to help minimize storage.

 

In the 5.2 release for the SAS® Scalable Performance Data Server, support for the Hadoop Distributed File System (HDFS) was added. Here are the supported Hadoop distributions, with or without Kerberos:

 

  • Cloudera CDH2

The SPD Server organizes data into a file format that has advantages for a distributed file system like HDFS. Advantages of the SPD Server file format include the following:

  • Data is separate from the metadata. The file format consists of separate files: one for data, one for metadata, and two for indexes. Each type of file has an identifying file extension. The extensions are .dpf for data, .mdf for metadata, and .hbx and .idx for indexes.

 SPDE.Blog2.image1.png 

  • SPD Server Cluster tables allow one or more identical structured SPD Server tables to be referenced as one logical table. The table(s) in the cluster table are called members. You can swap, add, or delete members without holding a lock on the cluster table. This provides flexibility in the maintenance of SPD Server cluster tables with the primary benefit being no downtime to end-users that need read access to the SPD Server cluster tables.
  • The SPD Server file format partitions the data by spreading it across multiple files based on a partition size. Each partition is stored as a separate physical file with the extension .dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file.

The default partition size is 128 megabytes. You can alter the default partition size by overwriting the MINPARTSIZE parameter of the spdserver.parm file.  

 

Like SAS data sets, the SPD Server table supports analytical base tables containing hundreds of thousands of columns. These analytical base tables become source tables to predictive analytical routines.

 

SPDE.Blog2.image2.png

 

Follow the community for my next post where we explore how to create SPD Server tables on HDFS.

Here are links to other posts in the SAS Data Management for Hadoop series for reference:

 

Comments
by New User Miressa
on ‎12-30-2015 11:05 AM

please help me to download SAS software

by Community Manager
on ‎01-04-2016 12:10 PM

Hi Miressa,

 

Thanks for your comment and visiting the community! Can you provide more detail? What version of SAS do you have? Or are you referring to SAS free software? If the latter, check out the SAS Analytics U Community that's packed with info to get started with SAS: free software, how-to guides, and connections to experts. 

 

Anna

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.