Evader-Agnostic Team-Based Pursuit Strategies in Partially-Observable Environments

Abstract

In this paper, we consider a scenario where a team of two unmanned aerial vehicles (UAVs) pursue an evader UAV within an urban environment. Each agent has a limited view of their environment where buildings can occlude their field-of-view. Additionally, the pursuer team is agnostic about the evader in terms of its initial and final location, and the behavior of evader. Consequently, the team needs to gather information by searching the environment and then track it to eventually intercept. To solve this multi-player, partially-observable, pursuit-evasion game, we develop a two-phase neuro-symbolic algorithm centered around the principle of bounded rationality. First, we devise an offline approach using deep reinforcement learning to progressively train adversarial policies for the pursuer team against fictitious evaders. This creates $k$-levels of rationality for each agent in preparation for the online phase. Then, we employ an online classification algorithm to determine a "best guess" of our current opponent from the set of iteratively-trained strategic agents and apply the best player response. Using this schema, we improved average performance when facing a random evader in our environment.

Environment

Training environment in Minigrid.

Testing environment in AirSim.

Diverse obstacles within the environment that limit movements among specific drones.

MatPlotLib rendering of GridWorld environment

AirSim rendering of the same game

Challenges

Challenge 1 - Modeling Evader Behavior and Long-term Planning under Partial Observability

In this scenario, the evader position, destination, and strategy are unknown and the pursuer team are unable to capture the evader.

Challenge 2 - Predicting and Responding to Behavior of Evader in Real-Time

In this scenario, the evader's real-time behavior unknown so the pursuers don't know what strategy to use to best identify and capture the evader.

Methodology

Solution. For modeling diverse evader strategies a pursuer team might see at runtime, we iteratively train a set of k policies using PPO where at each level-i for i in 0 to k-1, the i^th agent is trained against its adversary's i^th-1 policy. Each level's agent specializes in the defeating the previous level's adversary so we say that at each level, an agent's reasoning capacity is bounded by the adversary's strategy it was trained on. For each policy it is important that the pursuer team is able to plan long term since it is unaware of the evader's location, goal, and strategy. We do this by implementing using an LSTM to train the actor-critic allow the agents to learn long term histories of the game.

Challenge 1 - Modeling Evader Behavior and Long-term Planning under Partial Observability

High-level pursuer and low-level pursuer communicate and capture after seeing evader

Solution. To be able to predict the strategy of an unknown evader, we train a classification model that is able to predict the evader level based on its actions taken over time. To do this we implemented an LSTM with a single layer output and take the softmax of this output.

Challenge 2 - Predicting and Responding to Behavior of Evader in Real-Time

High-level pursuer and low-level pursuer identify, classify, and capture evader

Results

P0 vs E0. Best trained policy, with an offline capture rate 1.5x better than the second best policy. No loss in generality in the online case for deterministic OD
P0 vs E1. Best trained evader policy, with an offline capture rate of 3%, the robustness of online prediction shows a 900% increase in capture
P1 vs E1. Robust policy in regards to solving unknown evader start-end locations and not much loss in generality to the online case
P1 vs E0. Very low capture rate, clearly does not perform well, but does generalize in terms of tracking both for deterministic/random and offline/online
Takeaway. Each agent i is the most rational when going against its i - 1 adversary, showing that the bounded level-k schema is effective at finding optimal solutions to the game, also the classification module proves advantageous in particular when the Pursuer deals with an Evader it was not trained against as seen in scenario 2 and 4

Offline. High classification rate of 98% in offline training attributed to the fact that the classification module was trained with a dataset where the pursuers do not switch policies in the middle of an episode
Online. Noisy classifier attributed to dataset where pursuer and evader policies were static; despite this, with enough data, the opponent policy was correctly classified as the episode progressed