BookmarkSubscribeRSS Feed

Efficiently Managing and Migrating Models Between SAS and R with PMML

Started ‎09-18-2020 by
Modified ‎04-22-2020 by
Views 4,857

The Predictive Model Markup Language (PMML) is a modeling handler that utilizes XML to capture different model types developed and managed by The Data Mining Group (DMG). A model can be created in the PMML XML schema, using the PMML framework, and then be used in many different languages. SAS, SPSS, R, Apache Spark, and Terradata are a few that support PMML, for a full list check out The Data Mining Group's Website

 

Let's just jump into an example of the XML that defines a PMML schema for a regression model using the commonly available iris data, a similar example can be found on the DMG website: 

<PMML version="4.2" xsi:schemaLocation="http://www.dmg.org/PMML-4_2 http://www.dmg.org/pmml/v4-2/pmml-4-2.xsd">
 <Header copyright="Copyright (c) 2020 agannon" description="Linear Regression Model">
  <Extension name="user" value="agannon" extender="SoftwareAG PMML Generator"/>
  <Application name="SoftwareAG PMML Generator" version="2.2.0"/>
  <Timestamp>2020-04-20 16:10:43</Timestamp>
 </Header>
 <DataDictionary numberOfFields="5">
  <DataField name="sepallength" optype="continuous" dataType="double"/>
  <DataField name="sepalwidth" optype="continuous" dataType="double"/>
  <DataField name="petallength" optype="continuous" dataType="double"/>
  <DataField name="petalwidth" optype="continuous" dataType="double"/>
  <DataField name="species" optype="categorical" dataType="string">
   <Value value="setosa"/>
   <Value value="versicolor"/>
   <Value value="virginica"/>
  </DataField>
 </DataDictionary>
 <RegressionModel modelName="lm_Model" functionName="regression" algorithmName="least squares">
  <MiningSchema>
   <MiningField name="sepallength" usageType="predicted" invalidValueTreatment="returnInvalid"/>
   <MiningField name="sepalwidth" usageType="active" invalidValueTreatment="returnInvalid"/>
   <MiningField name="petallength" usageType="active" invalidValueTreatment="returnInvalid"/>
   <MiningField name="petalwidth" usageType="active" invalidValueTreatment="returnInvalid"/>
   <MiningField name="species" usageType="active" invalidValueTreatment="returnInvalid"/>
  </MiningSchema>
  <Output>
   <OutputField name="Predicted_sepallength" optype="continuous" dataType="double" feature="predictedValue"/>
  </Output>
  <RegressionTable intercept="2.17126629215507">
   <NumericPredictor name="sepalwidth" exponent="1" coefficient="0.495888938388551"/>
   <NumericPredictor name="petallength" exponent="1" coefficient="0.829243912234806"/>
   <NumericPredictor name="petalwidth" exponent="1" coefficient="-0.315155173326474"/>
   <CategoricalPredictor name="species" value="setosa" coefficient="0"/>
   <CategoricalPredictor name="species" value="versicolor" coefficient="-0.723561957780729"/>
   <CategoricalPredictor name="species" value="virginica" coefficient="-1.02349781449083"/>
  </RegressionTable>
 </RegressionModel>
</PMML>

 

The main parts of the PMML format are an initialization and verification of required variables (DataDictionary), model specific variable initialization and creation (MiningSchema), output field creation (Output), and the model and coefficients (RegressionTable). The PMML model definition can be written in the standard form:

PMML_Equation.png

 

The PMML version (in XML) is slightly bulky - but the benefit is the portability that it comes with.

 

Let's look at an example that creates the model (from the above PMML file) in R and then import and execute it in SAS. The below R code will produce a model for the same data and create the PMML version in an XML file format:

# required packages 
install.packages("datasets")
install.packages("pmml")
library(pmml)

# grab iris data
in_iris <- iris
colnames(in_iris) <- c("sepal_length", "sepal_width", "petal_length", "petal_width", "species")

# build regression model for sepal_length 
iris_lm <- lm(sepal_length ~ ., data = in_iris)

# create pmml and save as xml file
iris_lm_pmml <- pmml(iris_lm)
save_pmml(iris_lm_pmml,"/home/user/iris.xml")

 

This R code generates a linear regression model for estimation of sepal_length of the iris data. It then generates a PMML file and saves it as an XML file. This PMML file should match the PMML file from earlier. 

 

Now we get to see the real magic of utilizing PMML by applying it in SAS. The PMML can be read into SAS through the XML Libname Engine or by using the PROC PSCORE procedure. The PSCORE procedure is specifically built for reading in a PMML file and generating a SAS file that contains data step statements of the translated PMML model. This new file can be included inside a SAS data step to execute. Below is the code to read in and execute the model in SAS:

proc pscore
   pmml file = "/home/user/iris.xml"
   ds file = "/home/user/iris.sas";
run;

data work.iris_model;
   set work.iris;
   %include "/home/user/iris.sas";
run;

This SAS code will generate a SAS file and use it in the data step to generate results. The dataset work.iris_model will contain all the data and variables from the iris dataset, as well as a new column for the predicted_sepallength. If you are curious to see what the PSCORE file created, then you can view it in the location specified in the PSCORE procedure. Below is what the SAS file would look like from this example:

/********************************************/;
* PSCORE TIMESTAMP: 2020-11-3 8:9:1.14 ;
* SAS ENCODING: latin1 ;
* PMML Path: /home/user/iris.xml ;
* PMML SOURCE: SoftwareAG PMML Generator;
* PMML SOURCE VERSION: 2.2.0;
* PMML TIMESTAMP: 2020-03-11 07:56:19 ;
* MODEL TYPE: RegressionModel ;
/********************************************/;
PSCR_WARN = 0;
if missing("sepalwidth"n) then do; PSCR_WARN = 1; end;
if missing("petallength"n) then do; PSCR_WARN = 1; end;
if missing("petalwidth"n) then do; PSCR_WARN = 1; end;
if missing("species"n) then do; PSCR_WARN = 1; end;
else do;
    "PSCR_AP0"n = "species"n;
    if ( "PSCR_AP0"n not in ( 'setosa', 'versicolor', 'virginica'  )  ) then do; PSCR_WARN = 1; end;
end;
if (PSCR_WARN) then do; goto PSCR_EXIT; end ;
"Predicted_sepallength"n = 0 ;
"Predicted_sepallength"n = "Predicted_sepallength"n + ( 0.49588893838855 ) * "sepalwidth"n ;
"Predicted_sepallength"n = "Predicted_sepallength"n + ( 0.8292439122348 ) * "petallength"n ;
"Predicted_sepallength"n = "Predicted_sepallength"n + ( -0.31515517332647 ) * "petalwidth"n ;
if ("PSCR_AP0"n = 'versicolor') then "Predicted_sepallength"n = "Predicted_sepallength"n + ( -0.72356195778072 );
else if ("PSCR_AP0"n = 'virginica') then "Predicted_sepallength"n = "Predicted_sepallength"n + ( -1.02349781449083 );
"Predicted_sepallength"n = "Predicted_sepallength"n + 2.17126629215507 ;
PSCR_EXIT :
drop
 "PSCR_AP0"n PSCR_WARN;

 

The header of the file is created from the header of the PMML file. Next, the SAS code checks that all of the variables used in the model are available, and not missing, then it executes the model. When it's all said and done, we end up with the data and model results that we expect. The only difference is that translating the model from R to SAS was far easier and more stable than many other processes that are typically used. Below is the output data from the SAS execution:

iris_predicted.png

 

This model translation and portability between R and SAS is a powerful tool for organizations that use a portfolio of programming languages to accomplish their goals. With the ability to bridge the gap between different languages and programs including Apache Spark, R, and SAS, PMML is a great tool to manage model translation and management.

 

Note: if we were translating a model from SAS to R, one would need to generate a PMML XML file. This could be done by writing out to a fileref or use the XML Libname Engine (with a map) to generate the file. To read into R, additional packages like arules could be used for importing the PMML file. 

 

Below are a list of resources that can be beneficial for those more interested in PMML:

Data Mining Group - PMML Documentation 

R - PMML Package Documentation

SAS - PMML Documentation

List of PMML Supported Products (DMG)

Version history
Last update:
‎04-22-2020 08:44 PM
Updated by:
Contributors

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags