Deep Reinforcement Learning for Agents: Huggy and Doom

Awesome Examples of Pre-Generative AI Agents

If you’re not sure what AI agents are, this post is for you.

AI agents have been around long before LLMs and ChatGPT.

What is an AI agent?

  • an autonomous entity, operating without human intervention

  • that gets feedback from its environment

  • to make decisions and achieve goals


  • Voice Assistants: Siri, Alexa, Google Assistant.

  • Gaming: Starcraft, Dota bots.

  • Chatbots: Customer service chatbots to solve customer issues.

These agents were around way before ChatGPT, before LLMs, although they were not as flexible and adaptable.

In this post, I share awesome examples of pre-generative AI agents, explaining how they’re trained and how they work:

Here’s a Youtube version of this post:

Deep Reinforcement Learning

A popular method for training agents is deep reinforcement learning.

It enables agent to learn and make decisions in complex environments through trial and error, learning directly from interactions with their environments, guided by maximizing expected reward.

As the name suggests, deep reinforcement learning combines:

  1. Reinforcement Learning

Type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize rewards. The agent receives feedback in the form of rewards or penalties, helping it to learn optimal behaviors over time.

For example, giving your dog a treat for good behavior teaches your dog to do more of that good behavior.

  1. Deep Learning

Subset of machine learning using neural networks with many layers (i.e. deep neural networks). In deep reinforcement learning, these networks are used to approximate value functions or policies.

Key Components

Here are the key components of deep reinforcement learning:

  1. Agent: The learner or decision-maker.

  2. Environment: The external system with which the agent interacts.

  3. State: The current situation of the agent.

  4. Action: All possible moves the agent can take in a given state.

  5. Reward: Feedback from the environment based on the action taken.

  6. Policy: The strategy used to determine actions based on states.

  7. Value Function: A measure of how desirable the current state is.

How It Works

Here’s how the components all come together to train an AI agent:

  1. Exploration vs. Exploitation: agent explores its environment to gather information (exploration) and uses this information to make decisions that maximize rewards (exploitation).

  2. Learning Process: through repeated interactions, agent learns from rewards and penalties, adjusting its policy to improve performance.

  3. Neural Networks: used to approximate the policy or value function.


Huggy is an adorable project developed by HuggingFace, based on Puppo the Corgi by the Unity ML-Agents team.

The environment uses the Unity game engine and the ML-Agents toolkit, allowing the creation of environments to train agents.

In this case, Huggy learns to play fetch!


The primary goal in this environment is to train Huggy to fetch a stick.

To accomplish this, Huggy must move correctly towards the stick based on the information provided to him about his environment.

State Space

In reinforcement learning, the state space defines what the agent perceives.

Huggy can’t visually see his surroundings!

He only gets specific information:

  • Position of the target (stick)

  • Relative position between himself and the target

  • Orientation of his legs

Then, Huggy uses his policy to determine the next best actions.

Action Space

The action space is the set of all possible moves Huggy can take.

Huggy's movements are controlled by joint motors that drive his legs.

The action space consists of these movements:

Huggy learns to rotate the joint motors of each leg to move to the stick.

Reward Function

The reward function reinforces desirable behaviors and penalizes undesirable behaviors.

Here are the components of the reward function:

Training Huggy

To train Huggy, we teach him to run efficiently towards the stick.

At each step in time, Huggy must:

  1. Observe the environment

  2. Decide how to rotate each joint motor, without spinning

The training environment is designed with multiple copies where a stick spawns randomly.

When Huggy reaches the stick, it respawns elsewhere, providing diverse experiences and speeding up the training process.

Here’s Huggy learning to play fetch!

Try it yourself:


Here’s another example of AI agents trained via deep reinforcement learning, not exactly adorable but still very cool:

Teaching an AI agent to survive Doom without any prior knowledge.

The agent only knows:

  • life is desirable

  • death is undesirable

It must learn how to stay alive, recognizing that health is required for survival.

Eventually, the agent learns to collect health packs in order to survive.

This is built with VizDoom, an open-source python library to train AI agents to play Doom using only visual information.

VizDoom enables training directly from screen pixels.

In this example, our Doom AI agent plays the Health Gathering level.

Proximal Policy Optimization (PPO)

Training this agent uses the technique, Proximal Policy Optimization (PPO).

Remember, policy refers to the strategy an AI agent employs to determine its next actions.

In traditional policy optimization, making large updates to the policy can destabilize training.

This leads to poor performance.

PPO ensures policy updates are gradual and controlled.

It prevents drastic changes that could derail learning, keeping all updates within a safe range (known as “clipping”).

This added stability leads to more reliable learning and better performance in complex environments.

Environment Setup

Our AI agent’s objective is to learn how to survive.

But at the start, it doesn’t know what will help it survive.

Over time, the AI agent must learn that health is required for survival and medical kits (aka “medkits”) replenish health.

Here’s the environment:

  • Map: A rectangular space enclosed by walls with a hazardous green, acidic floor that periodically damages the agent.

  • Medkits: Initially scattered uniformly across the map, with additional medkits appearing intermittently. These medkits restore portions of the agent’s health, essential for survival.

  • Episode End: The simulation ends when the agent dies or after a timeout period.

Configuration Details

  • Reward: 1 point for living, incentivizing survival.

  • Penalty: 100 points for dying, teaching the agent not to repeat the actions leading to demise.

  • Action Space:

    • Turn left

    • Turn right

    • Move forward

  • Game Variable: Health, which the agent learns is connected to living.

The agent must navigate this environment, utilizing medkits to mitigate health loss from the acidic floor, making strategic decisions to prolong survival.

Here’s the Doom AI agent learning to survive!

Last Thoughts

If you enjoyed this post, I would love to hear from you!

Just hit reply and let me know if you’d like to see more posts about AI agents, including generative AI agents and multi agent systems!