Javatpoint Logo
Javatpoint Logo

SARSA Reinforcement Learning

SARSA, a sophisticated tool in the fascinating world of artificial intelligence, assists computers in learning how to make sound judgments. Consider teaching a computer to play a game, operate a car, or manage resources - SARSA is similar to a manual that teaches the machine how to improve over time. In this essay, we'll straightforwardly look at SARSA. We'll investigate how it works, what it's useful for, and how it makes our computers smarter by learning from their experiences. So, let's get started and enjoy the wonders of SARSA reinforcement learning!

What is SARSA?

Definition of SARSA:

SARSA is a reinforcement learning algorithm that teaches computers how to make good decisions by interacting with an environment. SARSA stands for State-Action-Reward-State-Action, which represents the algorithm's sequence of steps. It helps computers learn from their experiences to determine the best actions.

Explanation of SARSA:

Assume you're teaching a robot to navigate a maze. The robot begins at a specific location (the "State" - where it is), and you want it to discover the best path to the maze's finish. The robot can proceed in numerous directions at each step (these are the "Actions" - what it does). As it travels, the robot receives input through incentives - positive or negative numbers indicating its performance.

The amazing thing about SARSA is that it doesn't need a map of the maze or explicit instructions on what to do. It learns by trial and error, discovering which actions work best in different situations. This way, SARSA helps computers learn to make decisions in various scenarios, from games to driving cars to managing resources efficiently.

SARSA and Q Learning Algorithm

SARSA (State-Action-Reward-State-Action) and Q-learning are popular reinforcement learning algorithms for solving similar sequential decision-making problems. They belong to the family of temporal difference (TD) learning methods and are used to find optimal policies for an agent to maximize cumulative rewards over time. However, there are some critical differences between SARSA and Q-learning:

Policy Type:

  • SARSA: SARSA is an on-policy algorithm, which means it learns and updates its Q-values based on the policy it is currently following. This makes SARSA well-suited for scenarios where the agent interacts with the environment while following its learning policy.
  • Q-learning: Q-learning is an off-policy algorithm that learns and updates Q-values based on the optimal policy, regardless of the policy the agent is currently following. Q-learning often requires an exploration strategy to ensure it explores different actions.

Learning Update:

  • SARSA: The SARSA algorithm updates the Q-values based on the current state, the action taken, the reward received, the next state, and the following action chosen using the policy. It directly incorporates the following action's Q-value in its update rule.
  • Q-learning: Q-learning updates the Q-values based on the current state, the action taken, the reward received, and the maximum Q-value of the following state over all possible actions. It assumes that the agent will follow the optimal policy in the future.

Exploration Strategy:

  • SARSA: Since SARSA is an on-policy algorithm, it uses the same exploration strategy to select the current and subsequent actions. It typically uses strategies like epsilon-greedy to balance exploration and exploitation.
  • Q-learning: Q-learning's off-policy nature allows for a more flexible exploration strategy. The agent can use a more exploratory policy while updating its Q-values based on the optimal policy.

Convergence Speed and Stability:

  • Q-learning: Q-learning tends to converge to the optimal policy more efficiently in many cases, especially when the exploration strategy is well-designed and the exploration is thorough.
  • SARSA: SARSA can be more stable in some scenarios, as it updates Q-values based on the policy it follows. However, this can also lead to slower convergence if exploration is not managed correctly.


  • Both algorithms apply to various problems and have been used in robotics, gaming, finance, and more.

In summary, SARSA and Q-learning are related in that they both aim to solve reinforcement learning problems by learning optimal policies through interactions with the environment. However, they differ in their approach to learning, exploration, and the type of policy they update. The choice between SARSA and Q-learning often depends on the specific characteristics of the problem and the trade-offs between on-policy and off-policy learning.

Working of SARSA

The SARSA (State-Action-Reward-State-Action) reinforcement learning algorithm enables agents to learn and make decisions in an environment by maximizing cumulative rewards over time. It follows an iterative process of interacting with the environment, learning from experiences, and refining its decision-making policy. Let's break down how SARSA works step by step:


  • The process begins by initializing the Q-values for all possible state-action pairs. Q-values represent the agent's estimate of the expected cumulative reward it can achieve by taking a particular action in a specific state.
  • These initial Q-values can be set to arbitrary values, zeros, or any other appropriate initialization method.

Action Selection:

  • The agent starts in an initial state (S) of the environment. This state represents the current situation the agent is in.
  • The agent chooses an action (A) based on the current state and policy. The policy guides the agent's decision by specifying the probability or likelihood of selecting each action in the given state.
  • Common policy strategies include:
  • Epsilon-Greedy: Explores the best action (highest Q-value) most of the time but sometimes explores other actions (randomly).
  • Softmax: Chooses actions with probabilities proportional to their Q-values.

Taking Action and Receiving Reward:

  • After selecting an action, the agent executes it in the current state.
  • The environment responds by providing a reward (R) to the agent. This reward reflects the immediate benefit or penalty associated with the chosen action.
  • The agent transitions to a new state (S') due to taking the action. The environment evolves, and the agent moves to a different situation.

Next Action Selection:

  • In the new state (S'), the agent uses the same policy to select the next action (A') to take. This step ensures that SARSA is an on-policy algorithm, meaning it learns and updates Q-values based on the policy it is currently following.
  • The agent's policy may choose the following action based on past experiences or exploration strategies.

Updating Q-values:

  • SARSA learns by updating its Q-values based on the experiences it gains from interacting with the environment.
  • The agent calculates the TD (Temporal Difference) error, which is the difference between the estimated Q-value of the current state-action pair (Q(S, A)) and the sum of the observed reward and the estimated Q-value of the next state-action pair (Q(S', A')).
  • The Q-value of the current state-action pair is then updated using the TD error, a learning rate (alpha), the observed reward, and possibly other parameters. The update rule can be expressed as:

Q(S, A) ← Q(S, A) + alpha * (R + gamma * Q(S', A') - Q(S, A))

Where gamma (γ) is the discount factor that balances the importance of immediate and future rewards.


  • The agent continues this iterative process of interacting with the environment, selecting actions, receiving rewards, updating Q-values, and refining its policy.
  • The algorithm repeats for a predefined number of episodes or until a convergence criterion is met.

Over time, as the agent explores the environment and refines its Q-values based on interactions, it learns to make better decisions in different states. The policy becomes more refined, and the agent's actions align with those that maximize cumulative rewards. SARSA allows the agent to navigate complex environments and optimize its decision-making strategies for various applications through learning and adaptation.

Understanding Components of SARSA

  1. State (S): At the core of SARSA is the concept of "state." A state represents a snapshot of the environment, encapsulating all relevant information about the current situation the agent finds itself in. This information could include factors such as the agent's location, the conditions of the surroundings, available resources, or any other pertinent details that influence the agent's decision-making process.
  2. Action (A): Embedded within the SARSA algorithm is the notion of "action." An action signifies the choice the agent makes based on its current state. The agent's selected maneuver triggers a transition from the current state to a subsequent state. This transition is the agent's way of interacting with the environment, attempting to move forward in its pursuit of achieving favorable outcomes.
  3. Reward (R): The concept of "reward" forms a crucial aspect of the SARSA algorithm. A reward is a scalar value the environment provides in response to the agent's chosen action within a given state. This feedback signal indicates the immediate benefit or consequence of the agent's decision. Rewards guide the agent's learning process, indicating the desirability of particular actions in specific states.
  4. Next State (S'): As the agent takes action in a particular state, it triggers a transition to a new situation known as the "next state." This subsequent state (S') is the agent's updated environment after it has executed its chosen action. The interaction between the agent, the chosen action, the current state, and the ensuing next state is a fundamental progression that drives the SARSA algorithm's learning process.

Applications of SARSA

Game Playing:

  • SARSA can train agents to play games effectively by learning optimal strategies. In board games like chess, it can explore different move sequences and adapt its decisions based on rewards (winning, drawing, losing).
  • SARSA can control game characters in video games, making them learn to navigate complex levels, avoid obstacles, and interact with other in-game entities.


  • SARSA is invaluable for robotic systems. Robots can learn how to move, interact with objects, and perform tasks through interactions with their environment.
  • SARSA can guide a robot in exploring and mapping unknown environments, enabling efficient exploration and mapping strategies.

Autonomous Vehicles:

  • Self-driving cars can use SARSA to learn safe and efficient driving behaviors. The algorithm helps them navigate various traffic scenarios, such as lane changes, merging, and negotiating intersections.
  • SARSA can optimize real-time decision-making based on sensor inputs, traffic patterns, and road conditions.

Resource Management:

  • In energy management, SARSA can control the charging and discharging of batteries in a renewable energy system to maximize energy utilization while considering varying demand and supply conditions.
  • It can optimize the allocation of resources in manufacturing processes, ensuring efficient utilization of machines, materials, and labor.

Finance and Trading:

  • SARSA can be applied in algorithmic trading to learn optimal buying and selling strategies in response to market data.
  • The algorithm can adapt trading decisions based on historical market trends, news sentiment, and other financial indicators.


  • In personalized medicine, SARSA could optimize treatment plans for individual patients by learning from historical patient data and adjusting treatment parameters.
  • SARSA can aid in resource allocation, such as hospital bed scheduling, to minimize patient wait times and optimize resource utilization.

Network Routing:

  • Telecommunication networks can benefit from SARSA for dynamic routing decisions, minimizing latency and congestion.
  • SARSA can adapt routing strategies to optimize data transmission paths based on changing network conditions.


  • SARSA can adapt content and learning paths based on student performance data in educational platforms.
  • It can provide personalized learning experiences by recommending appropriate study materials and activities to individual students.

Supply Chain Management:

  • SARSA can optimize inventory management, dynamically adjusting reorder points and quantities to balance inventory costs and stockouts.
  • It can optimize transportation routes and delivery schedules to minimize shipping costs and ensure timely order fulfillment.

Industrial Automation:

  • In manufacturing, SARSA can optimize the scheduling of tasks on assembly lines, considering factors like machine availability and production efficiency.
  • It can control robotic arms for tasks like pick-and-place manufacturing operations.

Natural Language Processing:

  • SARSA can enhance dialogue systems or chatbots by learning to generate appropriate responses based on user interactions.
  • It can improve user experiences by adapting responses to user preferences and context.

Recommendation Systems:

  • SARSA can learn user preferences from historical interactions to make personalized products, services, or content recommendations.
  • It adapts recommendations over time based on user feedback and evolving preferences.

SARSA contributes to intelligent decision-making in these applications, enabling agents to learn and adapt strategies that optimize desired outcomes over time. Its ability to learn from experiences and improve decision-making makes it a versatile tool across diverse domains, leading to enhanced efficiency, automation, and user experiences.

Benefits of SARSA

SARSA (State-Action-Reward-State-Action) reinforcement learning algorithm has several distinct advantages, making it a valuable tool for solving sequential decision-making problems in various domains. Here are some of its key advantages:

On-Policy Learning:

SARSA is an on-policy learning algorithm, which means it updates its Q-values based on the policy it is currently following. This has several advantages:

  • Stability: SARSA's on-policy nature often leads to more stable learning. Since it learns from experiences generated by its policy, the updates align with the agent's actions, resulting in smoother and more consistent learning curves.
  • Real-Time Adaptation: On-policy algorithms like SARSA are well-suited for online learning scenarios where agents interact with the environment in real-time. This adaptability is crucial in applications such as robotics or autonomous vehicles, where decisions must be made on the fly while the agent is in motion.

Balanced Exploration and Exploitation:

SARSA employs exploration strategies, such as epsilon-greedy or softmax policies, to balance the exploration of new actions and exploitation of known actions:

  • Exploration: SARSA explores different actions to discover their consequences and learn the best strategies. This is essential for learning about uncertain or unexplored aspects of the environment.
  • Exploitation: The algorithm uses its current policy to exploit actions leading to higher rewards. This ensures that the agent leverages its existing knowledge to make optimal decisions.

Convergence to Stable Policies:

The combination of on-policy learning and balanced exploration contributes to SARSA's convergence to stable policies:

  • Smooth Learning: SARSA converges more smoothly than off-policy algorithms, especially in environments with complex dynamics. The agent learns gradually, progressing steadily toward an optimal policy without drastic fluctuations.

Sequential Decision-Making:

SARSA excels in scenarios involving sequential decision-making:

  • Long-Term Consequences: SARSA considers both immediate rewards and potential future rewards in its learning process. This makes it well-suited for scenarios where actions have delayed consequences, and the agent must plan for the long term.

Easily Adaptable to Different Environments:

SARSA's flexibility enables its application in diverse environments:

  • Model-Free Learning: SARSA is model-free, meaning it doesn't require explicit knowledge of the environment's dynamics. This applies to environments with complex, unknown, or stochastic behavior.

Incremental Learning:

SARSA learns incrementally, which offers several advantages:

  • Data Efficiency: The agent learns from limited data, making it suitable for scenarios where gathering extensive data is resource-intensive or time-consuming.
  • Real-Time Updates: SARSA can update its policy based on real-time interactions, enabling dynamic learning and adaptation.

Less Susceptible to Value Overestimation:

SARSA is generally less prone to overestimation of Q-values compared to other algorithms:

  • Accurate Action Values: This leads to a more accurate estimation of action values, contributing to better decision-making and policy optimization.

Applicability in Real-World Domains:

The practical benefits of SARSA's advantages extend to real-world applications:

  • Robust Decision-Making: In domains like robotics, finance, and autonomous systems, SARSA's ability to learn and adapt over time results in improved decision-making and performance.

SARSA's unique combination of on-policy learning, balanced exploration, and adaptability makes it a valuable tool for tackling complex decision-making challenges in diverse domains. Its advantages contribute to more stable learning, better convergence, and enhanced decision-making, making it a preferred choice in reinforcement learning.

Disadvantages of SARSA

While SARSA (State-Action-Reward-State-Action) has many advantages, it also has limitations and disadvantages. Let's explore some of these drawbacks:

  1. On-Policy Learning Limitation:
    • While advantageous in some scenarios, SARSA's on-policy learning approach can also be a limitation. It means that the algorithm updates its Q-values based on its current policy. This can slow down learning, especially in situations where exploration is challenging or when there's a need to explore more diverse actions.
  2. Exploration Challenges:
    • Like many reinforcement learning algorithms, SARSA can struggle with exploration in environments where rewards are sparse or delayed. It might get stuck in suboptimal policies if it needs to explore sufficiently to discover better strategies.
  3. Convergence Speed:
    • SARSA's convergence speed might be slower compared to off-policy algorithms like Q-learning. Since SARSA learns from its current policy, exploring and finding optimal policies might take longer, especially in complex environments.
  4. Bias in Value Estimation:
    • SARSA can be sensitive to initial conditions and early experiences, leading to potential bias in the estimation of Q-values. Biased initial Q-values can influence the learning process and impact the quality of the learned policy.
  5. Efficiency in Large State Spaces:
    • SARSA's learning process might become computationally expensive and time-consuming in environments with large state spaces. The agent must explore a substantial portion of the state space to learn effective policies.
  6. Optimality of Policy:
    • SARSA sometimes converges to the optimal policy, mainly when exploration is limited or when the optimal policy is complex and difficult to approximate.
  7. Difficulty in High-Dimensional Inputs:
    • SARSA's tabular representation of Q-values might be less effective when dealing with high-dimensional or continuous state and action spaces. Function approximation techniques would be needed to handle such scenarios.
  8. Trade-off Between Exploration and Exploitation:
    • SARSA's exploration strategy, like epsilon-greedy, requires tuning of hyperparameters, such as the exploration rate. Finding the right balance between exploration and exploitation can be challenging and impact the algorithm's performance.
  9. Sensitivity to Hyperparameters:
    • SARSA's performance can be sensitive to the choice of hyperparameters, including the learning rate, discount factor, and exploration parameters. Fine-tuning these parameters can be time-consuming.
  10. Limited for Off-Policy Tasks:
    • SARSA is inherently an on-policy algorithm and might not be the best choice for tasks where off-policy learning is more suitable, such as scenarios where learning from historical data is essential.

Despite these limitations, SARSA remains a valuable reinforcement learning algorithm in various contexts. Its disadvantages are often addressed by combining it with other techniques or by selecting appropriate algorithms based on the specific characteristics of the problem at hand.

Implementation in Python

Below is a basic implementation of the SARSA (State-Action-Reward-State-Action) reinforcement learning algorithm in Python. This example demonstrates how SARSA can train an agent to navigate a simple grid world environment. Please note that this is a simplified version for educational purposes and might require further optimization for more complex scenarios.


In this section, we define the grid world environment. The grid represents a simple environment where the agent needs to navigate from the start state (1) to the goal state (2) while avoiding obstacles (-1). The numbers represent different states in the grid.


Here, we define the possible actions and initialize the Q-values. Q-values represent the expected cumulative rewards for each state-action pair. We initialize the Q-values to zeros for all state-action pairs.


These are the hyperparameters for the SARSA algorithm. Alpha is the learning rate, gamma is the discount factor, and epsilon is the exploration rate.


This section implements the SARSA algorithm. It iterates through several episodes (num_episodes) to train the agent. In each episode, the agent starts from the initial state (start state) and follows the SARSA update rules to learn the optimal policy. The agent explores actions using an epsilon-greedy policy and updates its Q-values based on rewards and next actions.



Learned Q-values:
[[-0.71902719 -0.72055885 -0.71906261 -0.73588847]
 [-0.63504374  0.          0.36656018 -0.72767427]
 [-0.66545015  0.82710218  0.         -0.69249097]
 [-0.82899255  0.          1.27985369  0.        ]]
Optimal Path:
[1, 5, 6, 10, 9, 5, 6, 7, 8, 4, 0, 1, 2, 3, 7, 11, 10, 14]

Finally, we evaluate the learned policy by starting from the initial state and selecting actions based on the learned Q-values. The agent follows the policy to reach the goal state (2), and the optimal path is printed.

Youtube For Videos Join Our Youtube Channel: Join Now


Help Others, Please Share

facebook twitter pinterest

Learn Latest Tutorials


Trending Technologies

B.Tech / MCA