(Intro) (Returns)

reinforcement learning is one pillar of machine learning. Compared to supervised learning, it tells the machine WHAT to do instead of how to do it through a reinforcement system: rewarding desirable actions (positive reinforcement) and punishing undesirable ones (negative reinforcement)

Summary

(Review of key concepts)

Application

controlling autonomous robot/vehicles
optimizing factory
trading stock
playing games

Process

Define
1. variables for the state .
2. actions
State the problem
Define reward function for each possible situation
Learn state-action value function with Deep Reinforcement Learning
Compute policy

Example

(Example: Lunar Lander)

Step 4

Formalism

reinforcement learning tells the machine WHAT to do next based on its current state and the reward there (example)

This is called the Markov Process

State

Each state could be represented by a number or a vector. A state is continuous if it includes continuous-valued numbers like position, velocity, or angle.

Reward

A reward function outputs a number associated with a state, so it could be both positive and negative.

Example

Return

(Source) The system also calculates the result of the machine’s journey by return:

First reward is NOT discounted (discount factor=1)
For positive rewards, future rewards will be discounted further by lowering discount factor, e.g. ), but this process incentivizes the system to delay negative rewards as long as possible (e.g. paying loans)

Examples

Policy

(Source) (Refinement) A function that tells the machine what action to take in every state

Greedy/Exploitation: With a probability 0.95, choose the action that maximizes the Q-function
Exploration: With a probability 0.05, choose a random action

Greedy VS Exploration Tradeoff

If the machine just choose the best actions only (greedy), it doesn’t actually learn what are bad actions to avoid (exploration). So the probability allows it to try new actions, even if it’s not optimal, so that it could expand its learning.

The probability allows the machine to try new actions, even if it’s not optimal, so that it could learn and try to avoid it in the future.

State-action value function (Q-function)

(ML Spec) A function of state and action , which outputs Return if the machine:

starts in state
takes the action only once
behaves optimally after that, according to the Policy. It implies that might not be optimal, but subsequent actions should be. If we can compute for every state, it helps us pick the best action at each state and compute auto Policy

Bellman Equation

(ML Spec) Given state , action , next state and next action , is computed as:

Note that is the best possible return from state , which is the sum of for all states after

Intuition

The image illustrate a Mars Rover, where it could either move left or right (possible action) to enjoy rewards at different states. Yellow arrows indicate optimal policy.

If it starts from state 2 and goes right (which is not optimal), its Q-function (with discount factor = 0.5) is:

Similarly, if it goes left:

Repeat with every combination of state and action:

Example

Deep Reinforcement Learning

My (Chiffon) Nguyen

Explorer

reinforcement learning

Application

Process

Step 4

Formalism

State

Reward

Return

Policy

State-action value function (Q-function)

Bellman Equation

Deep Reinforcement Learning

Mini-batch

Soft updates

Graph View

Table of Contents

Backlinks