Swiftorial Logo
Home
Swift Lessons
Matchups
CodeSnaps
Tutorials
Career
Resources

Reinforcement Learning Tutorial

Introduction

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by performing certain actions and receiving rewards. The agent aims to maximize the cumulative reward over time. Unlike supervised learning, RL does not require labeled input/output pairs and instead learns from the consequences of its actions.

Key Concepts

There are several key concepts in reinforcement learning:

  • Agent: The learner or decision maker.
  • Environment: The external system the agent interacts with.
  • State (s): A representation of the current situation of the agent.
  • Action (a): Choices made by the agent.
  • Reward (r): Feedback received after performing an action.
  • Policy (π): Strategy used by the agent to determine the next action based on the current state.
  • Value Function (V): Expected cumulative reward from a state.
  • Q-Function (Q): Expected cumulative reward from a state-action pair.

Markov Decision Process (MDP)

MDP is a mathematical framework for modeling decision-making. It provides a formalization of the RL problem using states, actions, rewards, and transitions. The key components of MDP are:

  • State Space (S): Set of all possible states.
  • Action Space (A): Set of all possible actions.
  • Transition Probability (P): Probability of moving from one state to another, given an action.
  • Reward Function (R): Reward received after transitioning from one state to another, given an action.
  • Discount Factor (γ): Represents the importance of future rewards.

Exploration vs Exploitation

One of the fundamental challenges in RL is balancing exploration (trying new actions) and exploitation (using known actions that yield high rewards). Various strategies can be used to balance this trade-off:

  • ε-Greedy: With probability ε, choose a random action (explore), and with probability 1-ε, choose the best-known action (exploit).
  • Softmax: Select actions based on a probability distribution that emphasizes actions with higher estimated rewards.

Q-Learning

Q-Learning is a popular RL algorithm that aims to learn the optimal policy by updating the Q-values iteratively. The Q-value represents the expected cumulative reward of taking an action in a given state. The Q-learning update rule is:

Q(s, a) = Q(s, a) + α [r + γ maxa' Q(s', a') - Q(s, a)]

Where:

  • α: Learning rate
  • γ: Discount factor
  • r: Reward received after taking action a in state s
  • s': Next state
  • maxa' Q(s', a'): Maximum Q-value for the next state s'

Example: Q-Learning in Python

Here is a simple example of Q-Learning in Python using a grid world environment:


import numpy as np

# Define the environment
states = ["A", "B", "C", "D"]
actions = ["left", "right"]
rewards = {
    ("A", "right"): 1,
    ("B", "right"): 2,
    ("C", "right"): 3,
    ("D", "right"): 4,
}
transitions = {
    "A": {"left": "A", "right": "B"},
    "B": {"left": "A", "right": "C"},
    "C": {"left": "B", "right": "D"},
    "D": {"left": "C", "right": "D"},
}

# Initialize Q-values
Q = np.zeros((len(states), len(actions)))

# Parameters
alpha = 0.1
gamma = 0.9
epsilon = 0.1
episodes = 1000

for episode in range(episodes):
    state = "A"
    while state != "D":
        if np.random.rand() < epsilon:
            action = np.random.choice(actions)
        else:
            action = actions[np.argmax(Q[states.index(state)])]
        
        next_state = transitions[state][action]
        reward = rewards.get((state, action), 0)
        
        Q[states.index(state), actions.index(action)] = Q[states.index(state), actions.index(action)] + alpha * (reward + gamma * np.max(Q[states.index(next_state)]) - Q[states.index(state), actions.index(action)])
        
        state = next_state

print("Q-values:")
print(Q)
                
Q-values:
[[0.9  1.71]
 [0.   2.9 ]
 [0.   3.9 ]
 [0.   0.  ]]
                

Conclusion

Reinforcement Learning is a powerful technique for training agents to make decisions by interacting with an environment. It has applications in various fields such as robotics, game playing, and autonomous driving. While this tutorial covers the basics, there is much more to explore, including advanced algorithms like Deep Q-Learning and Policy Gradient methods.