Supervised learning techniques focus on using the data that have labels to train classifiers and then use the trained classifiers to do predictive modeling. However, in real-world applications such as fraudulent insurance claims, the labeled data are usually limited and very expensive to obtain.
Moreover, the key to supervised learning algorithms, such as k-NN and SVM, is that, in general, they depend on the assumption that nearby points are likely to have the same label, regardless of the underlying data structure. This is more an assumption of local consistency and thus might result in inferior classifiers. However, points on the same structure (typically referred to as a cluster) are likely to have the same label. This assumption is often called the cluster assumption and ensures global consistency. Semi-supervised learning ponders both of these assumptions.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Semi-supervised learning techniques combine labeled data and unlabeled data to address challenges posed by supervised learning and to improve classification performance. Semi-supervised learning is used for the same applications as supervised learning. But it uses both labeled and unlabeled data for training—typically, a small amount of labeled data with a large amount of unlabeled data (because unlabeled data are less expensive and take less effort to acquire). The target variable specifies the variable that contains the label. Input variables are used for distance calculation. Only interval variables are supported as input variables.
This type of learning can be used with methods such as classification, regression, and prediction. The goal is to predict the labels of the unlabeled points. The performance of an algorithm is measured by the error rate on these unlabeled points only. Semi-supervised learning is useful when the cost associated with labeling is too high to allow for a fully labeled training process. Early examples of this include identifying a person's face on a webcam. The semi-supervised learning algorithm has numerous applications, including fraud detection, web page classification, image recognition, medical imaging, natural language processing, and action recognition.
Graph-Based Semi-supervised Learning
A principled approach to semi-supervised learning is to design a classifying function that is sufficiently smooth with respect to the intrinsic structure collectively revealed by labeled and unlabeled observations.
The SEMISUPLEARN procedure in SAS Viya implements the graph-based semi-supervised learning algorithm, which is well known for its good performance and scalability for big data. Here the assumption is that the data (both labeled and unlabeled) are embedded within a low-dimensional cluster or manifold that can be reasonably expressed by a graph.
In the graph-based methods, label information of each sample is propagated to its neighboring sample until a global stable state is reached on the complete data set. For each observation in the query data table, PROC SEMISUPLEARN returns the predicted labels for the observations in both the unlabeled data table and the labeled data table.
Graph-based semi-supervised learning includes a 3-step process:
Model Studio Demo: Detecting Fraud
The semi-supervised learning algorithm to the PaySim data from Dr. Lopez-Rojas et. al. is applied. A graph-based semi-supervised learning model is created using a SAS Code node in Model Studio pipeline. The PaySimLabeled data set consists of 430 observations, 130 of which are labeled as fraud. The PaySimUnlabeled data set has 5000 observations, and none are labeled. The objective is to detect fraud in the unlabeled data.
In the Code Editor window, first, a CAS session is started, and all the default libraries are assigned.
options cashost='&dm_cashost' casport=&dm_casport;
cas;
caslib _all_ assign;
The next part of the program loads labeled and unlabeled SAS data files from a client location into the specified caslib and saves it with specified table name.
proc casutil;
load file="/home/PaySimLabeled.sas7bdat" outcaslib="casuser" casout="PaySimLabeled";
load file="/home/PaySimUnlabeled.sas7bdat" outcaslib="casuser" casout="PaySimUnlabeled";
run;
The following piece of code invokes the SEMISUPLEARN procedure to perform graph-based semi-supervised learning on the unlabeled data table casuser.PaySimUnlabeled and the labeled data table casuser.PaySimLabeled and output the results to ODS tables. The INPUT statement specifies that the interval variables are to be used as inputs. The OUTPUT statement requests that the predicted labels for the unlabeled and labeled tables be written to the data table casuser.PaySimOut.
proc SEMISUPLEARN data= casuser.PaySimUnlabeled label = casuser.PaySimLabeled gamma = 1000;
input %dm_interval_input;
output out = casuser.PaySimOut copyvar=(_all_);
target %dm_dec_target;
run;
By default, the radial basis function is used as a kernel to calculate the similarity in distance computation using the Gaussian kernel. The GAMMA parameter specifies 1000 as the inverse of the variance for Gaussian kernels when you compute the pairwise distance between data observations. This is essentially the RBF kernel width.
The following statements sort the output of PROC SEMISUPLEARN by id and show the observations from 100 to 200:
data PaySimOut;
set casuser.PaySimOut;
run;
proc sort data=PaySimOut;
by %dm_id;
run;
proc print data=PaySimOut(firstobs=100 obs=200);
run;
The successful run of the SAS Code node results in the following:
The Model Information table shows the number of unlabeled observations, number of labeled observations, number of levels for the target variable, gamma value, maximum number of iterations, the kernel used in the computation, number of nearest neighbors, and the loss.
You can try experimenting with GAMMA values to minimize the loss.
The PAYSIMOUT table is printed.
The table shows the values of the input and ID variables, the predicted labels for the 100 observations, and the indicators for the labeled or unlabeled data. For each row in the output data table, the _WARN_ column indicates whether the row of data is coming from labeled data or unlabeled data. The _WARN_ value is 1 for labeled data and 0 for unlabeled data. I_isFraud is the predicted target variable, which is generated for labeled as well as unlabeled samples.
PaySim first paper of the simulator: Lopez-Rojas, E. A., A. Elmir, and S. Axelsson. 2016. PaySim: A financial mobile money simulator for fraud detection." In the 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus.
Find more articles from SAS Global Enablement and Learning here.
I guess a use case might be in the insurance sector where fraud is often hidden, though there are probably cases where an investigation has taken place and a result has been found of fraud or not-fraud.
Anyone used this technique in other applications?
Colin
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.