Reinforcement Learning with SAS Visual Data Mining and Machine Learning (VDMML)

2 Likes

Reinforcement learning (RL) is used in a wide variety of fields. Examples include robotics, industrial, automation, dialogue creation, healthcare treatment recommendations, stock trading and computer games.

SAS Visual Data Mining and Machine Learning has provided batch reinforcement learning capabilities with Fitted Q-Networks (FQNs) for some time. The exciting news is that SAS now provides online “real-time” reinforcement learning with Deep Q-Networks (DQNs)!

What is Reinforcement Learning?

Reinforcement learning is a machine learning model. Recall in machine learning we may have supervised learning, unsupervised learning, reinforcement learning, etc. Unlike supervised learning, in reinforcement learning there is no supervisor. Instead, there is a reward signal that serves as is the feedback mechanism.

The goal of reinforcement learning is to maximize a long-term reward accumulated over a sequence of actions. This occurs as an iterative process through trial and error. Time/order matters in RL. The data are sequential and are not identically and independently distributed.

As shown in the diagram below, there is an agent acting in an environment. With each action of the agent there is a positive or negative reward/punishment and the state changes after each action. Thus the agent is now presented with a new state and can choose a new action.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

One example is a self driving car. the car exists in an environment that includes roads and so on. Actions the car may take could include moving forward, stopping, turning right, and so on.

The ultimate goal may be for the self-driving car to take me from my house to my favorite restaurant in the quickest way possible following all rules of the road and all safety precautions.

Examples of reinforcement learning algorithms include Markov Decision Processes, Q learning algorithms and SARSA (state-action-reward-state-action). Reinforcement learning methods may be on policy or off policy.

Off-policy algorithms are independent of the agent's actions
On-policy algorithms learn the value of the policy being carried out by the agent including the exploration steps; in on-policy methods, the policy that is used for updating (target policy) and the policy used for selecting the action to take (behavior policy) are one and the same

Q-learning is an example of an off-policy reinforcement learning method.

SAS VDMML let you use two different Q learning algorithms to accomplish reinforcement learning. Q learning seeks to learn a policy that maximizes total reward and it starts with a Q table. A Q table is a matrix of Q values for all possible states an all possible actions. Each cell of the table (each Q value) is initialized to zero. After each episode the Q values are updated and stored. Q stands for quality. High Q values indicate it is a good idea to take a particular action from a particular state. Low Q values indicate it is a bad idea to take a particular action from a particular state. The Q table becomes a reference table for the agent to select the best action based on the Q value.

Q learning is an iterative process that occurs in a series of steps. Below is an example of the steps for a Q learning process.

Q-values are updated when action at is taken from state s_t using an equation such as:

where:

α is the learning rate; weights new values versus old values
γ is the discount factor; balances immediate versus future reward
ε is the Epsilon-greedy factor and can be used to decide whether to explore or exploit

Reinforcement Learning with SAS VDMML Programming

Two Q learning methods are available in SAS VDMML.

Fitted Q network (FQN), which is a batch method
Deep Q network (DQN), which is an online real-time method

The deep Q network method was new in SAS VDMML stable release 2020.1.3 , and is accomplished through the rlTrainDqn CAS action.

The new deep Q network algorithm is similar to the fitted Q network algorithm in a couple of ways:

Both are model-free, off-policy reinforcement learning (RL) methods
Both train a neural network to learn from data to approximate a system’s state-action value function (Q-function)
- The Q-function returns a Q-value for each state-action pair in the data set
- Choosing an action that maximizes Q for a given state yields the agent’s optimal policy

But they are also quite different as follows.

Fitted Q-Network:

is a batch RL method
relies on a fixed set of experiences for training
the environment is represented by a historical data set that contains current state, action, reward, and next state

Deep Q-Network:

is an online RL method
lets you specify a custom environment through a url
the agent learns by interacting directly with an environment in real time
you can render the state
you can score new data using the policy once trained

The ability to use online real time reinforcement learning is a huge benefit of the new deep Q network! To follow an example and create a deep Q network of your own, see Susan Kahler’s article.

For More Information:

BE_reinfor