Each state could be represented by a number or a vector.
A state is continuous if it includes continuous-valued numbers like position, velocity, or angle.
Reward
A reward function outputs a number associated with a state, so it could be both positive and negative.
Example
Return
(Source)
The system also calculates the result of the machine’s journey by return:
For positive rewards, future rewards will be discounted further by lowering discount factor, e.g. ), but this process incentivizes the system to delay negative rewards as long as possible (e.g. paying loans)
Examples
Policy
(Source) (Refinement)
A function that tells the machine what action to take in every state
Greedy/Exploitation: With a probability 0.95, choose the action that maximizes the Q-function
Exploration: With a probability 0.05, choose a random action
Greedy VS Exploration Tradeoff
If the machine just choose the best actions only (greedy), it doesn’t actually learn what are bad actions to avoid (exploration). So the probability allows it to try new actions, even if it’s not optimal, so that it could expand its learning.
The probability allows the machine to try new actions, even if it’s not optimal, so that it could learn and try to avoid it in the future.
State-action value function (Q-function)
(ML Spec)
A function of state and action , which outputs Return if the machine:
starts in state
takes the action only once
behaves optimally after that, according to the Policy. It implies that might not be optimal, but subsequent actions should be.
If we can compute for every state, it helps us pick the best action at each state and compute auto Policy
Bellman Equation
(ML Spec)
Given state , action , next state and next action , is computed as:
Note that is the best possible return from state , which is the sum of for all states after
Intuition
The image illustrate a Mars Rover, where it could either move left or right (possible action) to enjoy rewards at different states. Yellow arrows indicate optimal policy.
If it starts from state 2 and goes right (which is not optimal), its Q-function (with discount factor = 0.5) is:
Similarly, if it goes left:
Repeat with every combination of state and action: