Understanding Reinforcement Learning: A High-Level Overview

Background & Motivation

  • Reading Reinforcement Learning: An Introduction by Sutton and Barto.
  • Book is dense but incredibly rewarding—complex ideas are simplified into core principles.
  • Writing this post to summarize and solidify understanding of Part 1 (core RL concepts).
  • Not a replacement for the book, just a high-level overview.

What is Reinforcement Learning?

  • Involves an agent interacting with an environment over time.
  • Agent takes actions, receives rewards, and transitions into states.
  • Objective: maximize cumulative reward over episodes.
  • Formally modeled as a Markov Decision Process (MDP).

Policies: The Agent’s Strategy

  • A policy defines the agent's behavior—how it picks actions in each state.
  • Can be:
    • Deterministic (e.g., always move left).
    • Stochastic (e.g., 25% chance of each direction).
  • Better policies lead to higher total rewards.

Example: Agent in a Maze

  • Maze = 2D grid; each cell is a state.
  • Actions: up, down, left, right.
  • Rewards:
  • -1 for each step.
  • +1 for reaching the goal (terminal state).
  • Hitting walls → agent stays in the same state get -1 reward.

Value Functions: How Good is a Policy?

  • A value function estimates expected future reward from each state under a given policy.
  • Key for comparing policies.
  • In the maze:
    • Value of a state = how close it is (in terms of reward) to the goal if you follow the policy.

Two Ways to Evaluate Policies

  • Solve the Bellman equation: A system of linear equations (As+b=0As + b = 0).
  • Iterative Policy Evaluation (preferred for large problems):
  • Estimate values over multiple passes.
  • More computationally feasible.

Policy Improvement & Optimization

  • Once a policy is evaluated, improve it: Adjust actions in high-value states to boost future reward.
  • Repeat:
    • Evaluate policy.
    • Improve policy.
    • Repeat until convergence (i.e., policy stops changing).
  • Result: optimal policy (may be multiple equally good options).

The Core Loop of Reinforcement Learning

  • Define the problem: Set up states, actions, and rewards.
  • Evaluate a policy: Understand how good the current strategy is. *Improve the policy: Make smarter decisions based on value.
    • Repeat until convergence.

Final Thoughts

  • RL is powerful because of its simplicity: learn by interacting.
    • Sutton & Barto distill it into fundamental ideas—elegant, even if the math gets heavy.
    • At its core: define, evaluate, improve, repeat.