Deploying anomaly detection score code in a SAS Data Mining and Machine Learning Model Studio project

4 Likes

Question

How can I deploy score code for anomaly detection from a Data Mining and Machine Learning project in Model Studio?

97403_banner_Asked&Answered_aqua.png

Answer

Currently in a Data Mining and Machine Learning project in Model Studio, you can only deploy the score code for a predictive model, i.e. a branch of the pipeline that includes a Supervised Learning node. But perhaps you want the score code from the Anomaly Detection node, which uses an unsupervised learning method (not involving the target variable). The SVDD procedure used in the node performs the support vector data description algorithm to detect anomalies or outliers in your data, based on the input variables only. Its score code is saved in an analytic store. To be able to deploy this analytic store to score new data, you can emulate having a model in the branch of your pipeline by adding a SAS Code node after the Anomaly Detection node and moving it to Supervised Learning. Your pipeline should now look as follows:

Now in the SAS Code node, you can include code to simulate a column of target predictions. Note if you don’t have a true target in your data, you can either create a pseudo one or use another variable in your data set that is not used as an input for the Anomaly Detection node. If your target is interval, enter the following code into the Scoring code pane of the SAS Code node editor to create a pseudo variable of target predictions:

length P_target 8;

P_target = .;

where target is replaced in both places above with the name of the actual target. If it is a binary or nominal target, you will need to do the following to create a pseudo variable for the posterior probability of the target event:

length P_targetlevel 8;

P_targetlevel = .;

where target is replaced with the target variable name, and level is replaced with the event level of the target. For example, if the target variable is named BAD with event level 1, the variable name would be P_BAD1. Note that these variable names assume that the resulting name length is 32 characters or less.

You will then from the Pipeline Comparison tab be able to deploy your score code in various ways (register, publish, download), just as you would for a supervised model. This approach also works for deploying score code from other Data Mining Preprocessing nodes, such as Clustering.

acordes · ‎07-28-2020

Thanks Wendy, this looks like the missing value trick which works like a charm in SAS Code or even in VA...

But I have some problems getting a score derived.

What could I miss out?

My target variable defined at the data tab is called SALES_RES.

But under P_SALES_RES after the sas code node promoted to supervised learning a get the mean of SALES_RES imputed, and not the score for the anomaly detection node.

acordes · ‎07-28-2020

as additional information I attach the EP Score Code:

The SVDDDISTANCE is derived correctly.

data sasep.out;
dcl double "filter_flag";
dcl package score _654S34YEIM3X7NLUBHJ8PQRK2();
dcl double "_SVDDDISTANCE_" having label n'SVDD Distance';
dcl double "_SVDDSCORE_" having label n'SVDD Score';
dcl double P_SALES_RES;
dcl double "EM_PREDICTION" having label n'Predicted for SALES_RES';
varlist allvars [_all_];

method init();
_654S34YEIM3X7NLUBHJ8PQRK2.setvars(allvars);
_654S34YEIM3X7NLUBHJ8PQRK2.setkey(n'28EF307116C3B095C53CFA09EAEDD3F4DA3C5727');
end;

method _9KJ0OEHROJTODXMF8WZ9WYRWH();

if (UPCASE(ORIGIN) = 'DATASET') * (YEAR(RMDFECHAVENTA) >= 2017.0) then
FILTER_FLAG = 0.0;
else FILTER_FLAG = 1.0;

end;

method _4GPALDPVKQ2PEFHKWWD8T8AMT();

P_SALES_RES = .;
if "P_SALES_RES" = . then "P_SALES_RES" = 0.5234044778;

end;

method post_4GPALDPVKQ2PEFHKWWD8T8AMT();
dcl double _P_;

EM_PREDICTION = "P_SALES_RES";

end;

method run();
set SASEP.IN;
_9KJ0OEHROJTODXMF8WZ9WYRWH();
_654S34YEIM3X7NLUBHJ8PQRK2.scoreRecord();
_4GPALDPVKQ2PEFHKWWD8T8AMT();
post_4GPALDPVKQ2PEFHKWWD8T8AMT();
end;

method term();
end;

enddata;

WendyCzika · ‎07-29-2020

Are you saying that in your saved data, you aren't seeing the _SVDDSCORE_ column populated with values? It should be applying the analytic store with the EP score code to create that column, though you won't see the code to create it explicitly like for the DATA step score code you see above.

acordes · ‎07-29-2020

I am seeing the _SVDDSCORE_ and SVDDDISTANCE.

But these scores were available in the data already after the anomaly detection node.

I thought that the sas code node would apply the scoring code to all observation were all input variables of the anomaly detection node were not missing.

but that's not the case and I get an imputed value of the mean of SALES_RES for the P_SALES_RES meanwhile my expectation was to see there the distance score...

Furthermore my intention is to download the epcode from the anomaly detection but that's not working either.

acordes · ‎07-29-2020

it worked!!!

I have to get used to not seeing much at looking at the epcode.

The missing link was to apply the score running the proc astore and using the epscore reference that shows up at the sas code results in the pipeline.

proc astore;
    describe rstore=MODELS._654S34YEIM3X7NLUBHJ8PQRK2_AST
        epcode; 
run;

proc astore;
score data=RISKNOBA.REMA_FSVBALA_EXT3_PABLO_RVS 
rstore=MODELS._654S34YEIM3X7NLUBHJ8PQRK2_AST
out=casuser.dedos2; 
run;

acordes · ‎07-29-2020

Wendy, one more question.

Can I use this scoring code in a different pipeline?

I suppose running this code in sas code node, but I don't now how to update the training data set...

/* SAS code */
proc astore;
score data=&dm_data
rstore=MODELS._654S34YEIM3X7NLUBHJ8PQRK2_AST
out=&dm_data_outmodel; 
run;

WendyCzika · ‎07-29-2020

You could use a Score Data node after the SAS Code node in the same pipeline to score a new data set.

acordes · ‎07-30-2020

Thanks, but I think I had not explained well what I want to do.

The scoring code comes from another task, let it be another pipeline or a code rune in SAS Studio.

OK, I could always prepare the data in this regard (applying a score for a global Cluster definition) before loading the table to the pipeline.

But if I wanted to achieve this in the current pipeline, how could I proceed. The blog post

https://github.com/sassoftware/sas-viya-dmml-pipelines/blob/master/sas_code_node/class_level_indicat... shows me the way.

But it's so macro loaded and sophisticated that I cannot transfer it to my needs. There the author creates dummy variables 0/1 coded for the class variables.

I want to apply a score (let it be cluster ID) and use this variable in the pipeline onward. And as far as I know a epcode cannot be applied line-wise. Therefore my idea is to score the training data and use the score variable from the output table to merge it with the training data table.