(Intro) (Returns)

reinforcement learning is one pillar of machine learning. Compared to supervised learning, it tells the machine WHAT to do instead of how to do it through a reinforcement system: rewarding desirable actions (positive reinforcement) and punishing undesirable ones (negative reinforcement)

Summary

Application

  • controlling autonomous robot/vehicles
  • optimizing factory
  • trading stock
  • playing games

Process

  1. Define
    1. variables for the state .
    2. actions
  2. State the problem
  3. Define reward function for each possible situation
  4. Learn state-action value function with Deep Reinforcement Learning
  5. Compute policy

Formalism

reinforcement learning tells the machine WHAT to do next based on its current state and the reward there (example)

This is called the Markov Process

State

Each state could be represented by a number or a vector. A state is continuous if it includes continuous-valued numbers like position, velocity, or angle.

Reward

A reward function outputs a number associated with a state, so it could be both positive and negative.

Return

(Source) The system also calculates the result of the machine’s journey by return:

  • First reward is NOT discounted (discount factor=1)
  • For positive rewards, future rewards will be discounted further by lowering discount factor, e.g. ), but this process incentivizes the system to delay negative rewards as long as possible (e.g. paying loans)

Policy

(Source) (Refinement) A function that tells the machine what action to take in every state

  • Greedy/Exploitation: With a probability 0.95, choose the action that maximizes the Q-function
  • Exploration: With a probability 0.05, choose a random action

The probability allows the machine to try new actions, even if it’s not optimal, so that it could learn and try to avoid it in the future.

State-action value function (Q-function)

(ML Spec) A function of state and action , which outputs Return if the machine:

  • starts in state
  • takes the action only once
  • behaves optimally after that, according to the Policy. It implies that might not be optimal, but subsequent actions should be. If we can compute for every state, it helps us pick the best action at each state and compute auto Policy

Bellman Equation

(ML Spec) Given state , action , next state and next action , is computed as:

Note that is the best possible return from state , which is the sum of for all states after

Deep Reinforcement Learning

Mini-batch

Soft updates